When it comes the generative part – using a Large Language Model to produce the final answer, augmented with the retrieved context. We need to integrate an LLM into our pipeline. There are a few ways to do this, depending on your resources and requirements:

  • Hosted API (external)
    The simplest way can be using an API like OpenAI’s GPT-4 or Azure OpenAI. You’d send the prompt (including the retrieved documents) to the API and get back a completion. This offers access to very powerful models, but has drawbacks: data leaves your environment (which may be a compliance issue), there are costs per call, and latency can vary. Also, you must trust the external service with potentially sensitive context.
  • Open-source or proprietary model (self-hosted)
    Many organizations choose to host an LLM themselves, especially with high-quality open models available (Llama 2, Falcon, MPT, etc., or custom models). NVIDIA’s NeMo framework, for example, provides pretrained LLMs (like the Megatron family) that you can deploy. Databricks released Dolly 2.0 (a smaller instruction-tuned model). You could run these on an AWS GPU instance (like g5 or p4d instances) or on Databricks’ GPU clusters. Hosting yourself gives more control over data and the ability to customize the model.
  • Databricks Model Serving
    Databricks allows hosting models as REST endpoints. You could package your LLM (perhaps a HuggingFace transformer) and serve it behind an endpoint within your workspace. This is similar to self-hosting but managed by Databricks (scales with your cluster and integrates with MLflow for versioning). If your model is small enough or you have enough GPUs attached to the endpoint, this can serve real-time queries.
  • NVIDIA Triton Inference Server
    NVIDIA offers Triton, which is an inference server optimized for deploying AI models (including LLMs). Triton can be run on a Docker container in AWS (on a GPU instance or in Kubernetes). It supports dynamic batching, concurrent model execution, and is optimized to work with TensorRT (we will discuss optimization soon). Triton could host the LLM and even the embedding model, exposing an API for both.

Whichever deployment method, the interaction pattern remains: construct a prompt that includes the retrieved texts and ask the model to answer the query using them.

A typical prompt template for RAG is:

"You are an expert assistant with access to the following context information. 
Context:
[Document 1 text]

[Document 2 text]

[Document 3 text]

Using ONLY the provided context, answer the question at the end. If the context does not have the answer, say you don't know.

Question: <<<USER QUESTION>>>
Answer:"

We instruct the model to use only the context (to reduce hallucinations) and handle unknowns gracefully. How you format the context is important – you should delineate documents clearly (perhaps with numbering or titles if available). Ensure the prompt isn’t too close to the model’s token limit.

Let’s illustrate with a Python snippet using an open-source model via Hugging Face Transformers. For demonstration, we’ll use a small instruct model (note: in practice we’d use a much larger model for quality):

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "databricks/dolly-v2-3b"  # Dolly v2 3B as an example
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

context_docs = "\n\n".join(top_chunks)  # join retrieved chunks with double newlines
prompt = (f"You are a helpful assistant with access to the following context:\n{context_docs}\n"
          f"Using only this information, answer the question: {user_question}\nAnswer:")

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

In this code, we load the Dolly v2 3B model (which is instruction-tuned for Q&A style responses). We then build the prompt by inserting the retrieved documents and the user question. Finally, we use model.generate to get the output. We turn off sampling (do_sample=False) to get a deterministic answer (helpful for consistency).

In practice, Dolly 3B may not produce a perfect answer for a complex question – it’s just an example. If we were using Llama-2 70B or GPT-4 via API, the quality would be higher. The integration logic stays the same regardless of model.

Using NVIDIA NeMo models

If you opt for NVIDIA’s stack, you might use a model like NeMo Megatron 20B or 70B (if available) which has been trained on broad data. You could fine-tune it on your domain if needed (though RAG often reduces the need to fine-tune the base LLM, since we supply the facts – fine-tuning might still be useful for style or understanding domain questions better). NeMo also offers the NeMo Service framework to serve models via microservices – for instance, the NeMo inference server can expose an endpoint for the LLM that your application calls, similar to Triton.

Customization and Alignment

One advantage of hosting your own LLM is you can customize it. Using NeMo Customizer, enterprises can fine-tune and align LLMs efficiently via Low-Rank Adaptation (LoRA) or prompt tuning techniques. LoRA injection can, for example, teach the model to better follow your prompt format or adhere to company style guidelines, without needing to update all model weights. If you find the model isn’t integrating the retrieved evidence well, you could fine-tune on a few examples of question + relevant docs -> answer, effectively teaching it how to do RAG-style Q&A. Similarly, guardrails can be implemented at the model level (NeMo Guardrails or OpenAI’s system messages) to avoid certain content.

At runtime, once the LLM generates an answer, we may want to post-process it (next part) before returning it to the user. But at this stage, we have essentially the heart of RAG: the LLM has produced an answer using external data. For example, with our earlier scenario, the answer might be: “To reset your company email password, visit the Outlook login page and click ‘Forgot Password’. If that doesn’t work, use the IT self-service portal (it.example.com/reset) or contact IT support at ext. 1234.” – notice it combines info from multiple retrieved chunks to form a helpful answer.