Search is broken. Or rather — keyword search is. It’s the go-to method for most systems trying to help users find relevant information, but it falls flat the moment wording changes or synonyms are used. And in contexts like internship searches, that’s a problem. In this article, I want to give you insights into my bachelor thesis' topic.
Why We Needed a Better Search
Imagine you're a student looking for an internship in AI. You type “AI in Python” into the search bar, but the only hits are those that mention the exact phrase. Offers like “Machine Learning using PyTorch” or “Data Science internship with TensorFlow” go unnoticed, despite being totally relevant.
This happens all the time in internship search portals — including the one I was working with. Most of these systems rely on simple string-matching, making them blind to meaning. That means good matches are missed, and irrelevant results often rise to the top.
I set out to change that.
The Idea: Make Search Semantic
The goal was to build a search engine that understands meaning, not just words. In particular, I wanted to:
- Recognize relevant internships even if the wording doesn't match exactly
- Allow users to search with nuanced queries, like “no customer contact”
- Support future extensions like recommendations and skill-based filtering
To get there, I explored a range of techniques — from traditional keyword methods to state-of-the-art vector search and knowledge graphs.
Step-by-Step: What I Built
This wasn’t just a software project — it followed a rigorous Design Science Research (DSR) process with iterative development, evaluation, and refinement.
1. Start Simple: Keyword Search
As a baseline, I built a classical keyword search engine using a simple tokenizer. It worked… okay. But it missed almost every semantically phrased query.
2. Improve with BM25
Next came the BM25 algorithm — a probabilistic ranking model commonly used in search engines. It ranked results slightly better by considering term frequency and document length, but still lacked semantic understanding.
3. Enter Vector Search with ChromaDB
This was the breakthrough. I used ChromaDB, a lightweight open-source vector database, to embed internship descriptions and user queries into a shared semantic space using Sentence Transformers like 'all-MiniLM-L6-v2' and 'all-mpnet-base-v2'.
Queries were transformed into vectors, and results were ranked based on cosine similarity — a measure of closeness in meaning, not just text.
4. Go Deep with Knowledge Graphs
To push things further, I built a custom ontology for internships and extracted structured data (entities, skills, technologies) using NER. These were stored in a Neo4j graph database and enriched with embedding-based similarity and LLM-based relation inference.
Graph-based querying allowed for smarter filters like “only internships in frontend development without ABAP,” even if these weren’t explicit keywords in the text.
5. LLM-Based Evaluation
How do you know if a semantic search actually works better? I used a local LLM (Gemma-3B) to simulate human judgment by evaluating how well each internship matched sample queries. The results were ranked pairwise and scored using nDCG — a standard metric in ranked retrieval.
How Did It Perform?
Here’s how different retrieval approaches performed in terms of nDCG@5 and nDCG@10 — higher is better:
Retriever | nDCG@5 | nDCG@10 |
---|---|---|
KeywordRetriever | 0.1732 | 0.2672 |
BM25Retriever | 0.1796 | 0.2746 |
ChromaRetriever | 0.1935 | 0.3128 |
ChromaRetriever (mpnet) | 0.2805 | 0.4237 |
GraphRetriever | 0.1892 | 0.2495 |
TripleScoringRetriever | 0.2079 | 0.3788 |
ContextualTripleRetriever | 0.2371 | 0.3140 |
The best performer? A pure vector search using 'all-mpnet-base-v2' and ChromaDB — simple, fast, and remarkably effective.
Why It Matters
With vector search, the engine starts to understand
user intent. It can match synonyms, detect context, and even handle complex filters like "exclude ABAP and customer contact." That's simply not possible with keyword search.
Beyond internship listings, this approach can be used in:
- Job portals
- Knowledge base search
- FAQ systems
- Recommendation engines
Try It Yourself
If you’re building a search function, don’t settle for keyword matching. Try vector search.
You can get started in Python using ChromaDB:
pip install chromadb
Then embed your data with Sentence Transformers:
from chromadb.utils import embedding_functions
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")
chroma = chromadb.Client()
collection = chroma.create_collection(name="internships", embedding_function=embedding_fn)
collection.add(
documents=["Data Science internship using Python", "ABAP development role in ERP"],
ids=["doc1", "doc2"]
)
results = collection.query(query_texts=["AI in Python"], n_results=2)
print(results)
Final Thoughts
Building a semantic search engine might seem like overkill — until you try it and see how much better it feels. No more “0 results” for reasonable queries. No more manual guesswork. Just answers that make sense.
We’re moving into a new era of search — one where understanding beats matching. And thanks to tools like ChromaDB and Sentence Transformers, it's easier than ever to build something smarter.
Want to see how this was used in a real system? Reach out — happy to share insights, minus the confidential bits.