Skip to content

How to Build a Local RAG with PDFs Using Ollama, LangChain, Chroma, and Gradio

Apr 17, 2026 · by Dionisio

Desk setup with a laptop, printed PDFs, and an interface for asking questions about local documents

Ollama + LangChain + Chroma

A private AI assistant for your PDFs, running on your machine with no usage fees

If you want to move beyond generic chat and build something that feels like a real AI application, this is a great project to start with. In less than 30 minutes you can assemble a local RAG pipeline that reads PDFs, indexes their contents, and answers questions with actual context.

RAG Ollama Chroma PDFs

Everything here runs 100% locally, with no external API, no token costs, and full privacy.

In less than half an hour, you can have an AI assistant that reads ebooks, documents, technical articles, manuals, or any folder of PDFs and answers questions as if it were focused on that material alone.

This architecture is called RAG (Retrieval-Augmented Generation) and it remains one of the most useful foundations for real AI applications in 2026. The best part is that you do not just build something portfolio-worthy. You also learn the fundamentals on the way.

  • RAG is one of the most common AI application architectures in real products, not just chat demos
  • running locally gives you privacy, control, and zero marginal cost
  • you learn embeddings, vector search, chunking, retrieval, and prompting in practice
  • the final result is a strong project for AI, automation, or intelligent backend portfolios
  • how to use Ollama to run local models
  • how to build a RAG pipeline with LangChain
  • how to use Chroma as a vector database
  • how to generate embeddings with nomic-embed-text
  • how to launch a simple and polished interface with Gradio
  • how to connect all of that into an application that feels useful beyond the tutorial
  • Python 3.10+
  • Ollama installed from ollama.com
  • at least 8 GB of RAM free, ideally 16 GB
  • one or more PDFs to test with

After installing Ollama, open the terminal and run:

Terminal window
ollama pull nomic-embed-text
ollama pull llama3.2

This usually takes a few minutes.

If your machine can handle it, try qwen2.5:7b later. It often performs better.

Terminal window
mkdir rag-local-pdf
cd rag-local-pdf
mkdir pdfs

Then place your files inside the pdfs folder.

Terminal window
python -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate

If everything worked, you should see something like this in the terminal:

Terminal window
(venv) PS C:\Projects\rag-local-pdf>
Terminal window
pip install langchain langchain-community langchain-ollama langchain-chroma pypdf gradio

This can take a little while because there is quite a bit to download.

Save this code as app.py:

import os
import gradio as gr
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
PERSIST_DIR = "./chroma_db"
PDF_FOLDER = "./pdfs"
embeddings = OllamaEmbeddings(model="nomic-embed-text")
llm = ChatOllama(model="llama3.2", temperature=0.3)
prompt_template = """You are a helpful and precise assistant.
Answer ONLY based on the context below. If you do not know, say "I do not have enough information".
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(prompt_template)
if not os.path.exists(PERSIST_DIR) or len(os.listdir(PDF_FOLDER)) > 0:
print("Indexing PDFs for the first time...")
loader = PyPDFDirectoryLoader(PDF_FOLDER)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory=PERSIST_DIR,
)
print(f"{len(splits)} chunks indexed!")
else:
vectorstore = Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embeddings,
)
print("Vector database already exists.")
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
def chat(message, history):
response = rag_chain.invoke(message)
return response
with gr.Blocks(
title="My Local AI - Chat with PDFs",
theme=gr.themes.Soft(),
) as demo:
gr.Markdown(
"# Local RAG with PDFs\nAsk anything about the PDFs in the `./pdfs` folder"
)
gr.ChatInterface(
fn=chat,
title="Chat with your documents",
description="Everything runs on your machine. Private and free.",
examples=[
"What is the main point of the document?",
"Summarize the text in 3 sentences",
"What does it say about [topic from your PDF]?",
],
)
if __name__ == "__main__":
demo.launch(share=False)
Terminal window
python app.py

On the first run, it will load and index every PDF in the folder. Depending on the file sizes, this can take a few minutes and use a fair amount of CPU and memory.

If resource usage spikes, that is expected.

When indexing finishes, open the link printed in the terminal. It usually looks like this:

http://127.0.0.1:7860

This opens a friendly chat-style interface in the browser, except it is running on your own machine.

From there, the workflow is simple: ask questions about the PDFs and watch the model answer using retrieved context.

Response time depends a lot on your hardware. It will be slower than a paid cloud LLM, but in exchange you get privacy, predictable cost, and a much more concrete understanding of the architecture.

At a high level, the flow is:

  1. the PDFs are loaded
  2. the text is split into chunks
  3. each chunk becomes an embedding
  4. the embeddings are stored in Chroma
  5. when you ask a question, the system retrieves the most relevant chunks
  6. the model answers using that context

That is the core of almost every serious RAG application.

Once this is working, you can evolve the project with:

  • PDF upload through the interface
  • conversation memory
  • citations with source and page number
  • support for multiple collections
  • model switching for performance comparison
  • more refined index persistence

If you wanted a straightforward project for learning applied AI without relying on a paid API, this is one of the best places to start.

And when you open the project in the browser, the interface looks like this:

Local RAG chat UI running in the browser with a question about an algorithm and the generated answer.