When a user query comes into a RAG system (for example, via a chatbot UI or an API call), the system must retrieve relevant documents and prepare them to feed into the LLM. This involves several sub-steps: query preprocessing, embedding, vector search, and possibly re-ranking of results.

Query preprocessing
This can be minimal – often just take the raw user question. However, depending on the use case, you might do light cleanup or enhancement:

  • You might remove or correct obvious typos in the query to improve retrieval (though semantic models are somewhat robust to minor typos).
  • If your system supports natural language or keyword queries interchangeably, you could run a keyword extraction to later highlight those in the documents (for re-ranking). But typically, for pure semantic search, no special tokenization or stopword removal is needed – the embedding model handles that.
  • If the query is part of a conversation, you might prepend relevant context (e.g., the conversation topic or the last user turn) to give more context to retrieval. Some advanced systems form a combined search query from the conversation history.

Query embedding
Using the same embedding model as before, encode the user query into a vector. This is usually fast since it’s a single sentence or question. If using Databricks Vector Search with an endpoint, you would call the endpoint or use their function to embed the query (some systems let you query by raw text and internally embed it). For example, if using FAISS in code:

query = "How can I reset my company email password?"
q_vec = model.encode([query], normalize_embeddings=True)
D, I = index.search(np.array(q_vec, dtype='float32'), k=5)

Now I[0] gives the indices of the top 5 chunks most similar to the query.

Retrieve top k documents
The vector index returns the nearest chunks. Each chunk comes with a similarity score (like the D from FAISS above, or some distance). At this stage, you’ll fetch the actual text of those chunks (and possibly their metadata) to pass forward. If using Databricks, you might get back the actual records (since it knows the table), or you’d use the indices to join with the Delta table to get text.

One subtle point: choosing k, the number of chunks to retrieve. A common setting is k=5 or k=10. Too low and you might miss relevant info (if one of the top chunks is only partially helpful); too high and you risk including irrelevant or redundant info which could confuse the LLM (and also you might overflow its context window with too much text). A strategy is to retrieve, say, 10, then later choose a subset (like the top 3) to actually use, perhaps based on a threshold or re-ranking.

Document re-ranking
In some cases, the initial vector similarity may not perfectly correlate with relevance, especially if the embedding model isn’t tuned for your specific domain. An optional step is to use a more expensive cross-attention re-ranker model. This is typically a smaller BERT-style model that takes the query and each retrieved chunk as input and outputs a relevance score. Microsoft’s ANCE or Cross-Encoder for MS MARCO are examples of such models. This step can reorder the top-k results or filter out false positives. For instance, if your embedding returned a chunk that had similar words but actually is about a different topic (false friend), a cross-encoder can catch that by looking at the full text and query together.

However, re-ranking adds extra latency (running another model for each candidate). If latency is critical and the embedding model is good, many RAG systems skip this. Another approach is to rely on the LLM itself: provide it maybe 5-6 chunks and let it figure out which parts to use. Some systems even retrieve more and then in the prompt instruct the LLM to pick the most relevant info. That leads to prompt-engineering solutions like giving each chunk a number and asking the LLM “which of these are relevant?” before answering – but that’s more complex and can usually be avoided by getting retrieval right.

Filtering results
Depending on the query or user permissions, you may need to filter out some retrieved docs. For example, if a user doesn’t have access to confidential document A, even if it’s similar, you should not show it. In a unified platform like Databricks, such security filtering can be done via the catalog before the query or by limiting which data that user’s query searches. Always enforce these checks, otherwise RAG could inadvertently expose data (the LLM might quote from a document the user isn’t supposed to see!).

After these steps, we have a set of relevant text chunks that will serve as context.
A simple representation could be:

top_chunks = [chunks[i] for i in I[0]]  # get the actual text of top results

Each chunks[i] might be a few sentences or a paragraph.

For concreteness, imagine the query is “How do I reset my company email password?” and our top 3 retrieved chunks are:

  • Chunk 1 (from an IT FAQ): “...To reset your company email password, go to the Outlook login page and click ‘Forgot Password’. Enter your company username, then follow the verification steps. You will receive a temporary code...”
  • Chunk 2 (from an internal policy): “...Employees must update their passwords every 90 days. The IT department provides a self-service portal for password resets at it.example.com/reset...”
  • Chunk 3 (from a troubleshooting guide): “...If you cannot reset the password via the portal, contact support at ext. 1234. Ensure you have your employee ID ready...”

These are short, relevant pieces of information.