From Keywords to Context

Search is broken. Or rather — keyword search is. It’s the go-to method for most systems trying to help users find relevant information, but it falls flat the moment wording changes or synonyms are used. And in contexts like internship searches, that’s a problem. In this article, I want to give you insights into my bachelor thesis' topic.

Why We Needed a Better Search

Imagine you're a student looking for an internship in AI. You type “AI in Python” into the search bar, but the only hits are those that mention the exact phrase. Offers like “Machine Learning using PyTorch” or “Data Science internship with TensorFlow” go unnoticed, despite being totally relevant.

This happens all the time in internship search portals — including the one I was working with. Most of these systems rely on simple string-matching, making them blind to meaning. That means good matches are missed, and irrelevant results often rise to the top.

I set out to change that.

The Idea: Make Search Semantic

The goal was to build a search engine that understands meaning, not just words. In particular, I wanted to:

Recognize relevant internships even if the wording doesn't match exactly
Allow users to search with nuanced queries, like “no customer contact”
Support future extensions like recommendations and skill-based filtering

To get there, I explored a range of techniques — from traditional keyword methods to state-of-the-art vector search and knowledge graphs.

Step-by-Step: What I Built

This wasn’t just a software project — it followed a rigorous Design Science Research (DSR) process with iterative development, evaluation, and refinement.

1. Start Simple: Keyword Search

As a baseline, I built a classical keyword search engine using a simple tokenizer. It worked… okay. But it missed almost every semantically phrased query.

2. Improve with BM25

Next came the BM25 algorithm — a probabilistic ranking model commonly used in search engines. It ranked results slightly better by considering term frequency and document length, but still lacked semantic understanding.

3. Enter Vector Search with ChromaDB

This was the breakthrough. I used ChromaDB, a lightweight open-source vector database, to embed internship descriptions and user queries into a shared semantic space using Sentence Transformers like 'all-MiniLM-L6-v2' and 'all-mpnet-base-v2'.

Queries were transformed into vectors, and results were ranked based on cosine similarity — a measure of closeness in meaning, not just text.

4. Go Deep with Knowledge Graphs

To push things further, I built a custom ontology for internships and extracted structured data (entities, skills, technologies) using NER. These were stored in a Neo4j graph database and enriched with embedding-based similarity and LLM-based relation inference.

Graph-based querying allowed for smarter filters like “only internships in frontend development without ABAP,” even if these weren’t explicit keywords in the text.

5. LLM-Based Evaluation

How do you know if a semantic search actually works better? I used a local LLM (Gemma-3B) to simulate human judgment by evaluating how well each internship matched sample queries. The results were ranked pairwise and scored using nDCG — a standard metric in ranked retrieval.

How Did It Perform?

Here’s how different retrieval approaches performed in terms of nDCG@5 and nDCG@10 — higher is better:

Retriever	nDCG@5	nDCG@10
KeywordRetriever	0.1732	0.2672
BM25Retriever	0.1796	0.2746
ChromaRetriever	0.1935	0.3128
ChromaRetriever (mpnet)	0.2805	0.4237
GraphRetriever	0.1892	0.2495
TripleScoringRetriever	0.2079	0.3788
ContextualTripleRetriever	0.2371	0.3140

The best performer? A pure vector search using 'all-mpnet-base-v2' and ChromaDB — simple, fast, and remarkably effective.

Why It Matters

With vector search, the engine starts to understand user intent. It can match synonyms, detect context, and even handle complex filters like "exclude ABAP and customer contact." That's simply not possible with keyword search.

Beyond internship listings, this approach can be used in:

Job portals
Knowledge base search
FAQ systems
Recommendation engines

Try It Yourself

If you’re building a search function, don’t settle for keyword matching. Try vector search.

You can get started in Python using ChromaDB:

pip install chromadb

Then embed your data with Sentence Transformers:

from chromadb.utils import embedding_functions

embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")

chroma = chromadb.Client()
collection = chroma.create_collection(name="internships", embedding_function=embedding_fn)

collection.add(
    documents=["Data Science internship using Python", "ABAP development role in ERP"],
    ids=["doc1", "doc2"]
)

results = collection.query(query_texts=["AI in Python"], n_results=2)
print(results)

Final Thoughts

Building a semantic search engine might seem like overkill — until you try it and see how much better it feels. No more “0 results” for reasonable queries. No more manual guesswork. Just answers that make sense.

We’re moving into a new era of search — one where understanding beats matching. And thanks to tools like ChromaDB and Sentence Transformers, it's easier than ever to build something smarter.

Want to see how this was used in a real system? Reach out — happy to share insights, minus the confidential bits.

MBits.dev