Local RAG Pipeline for AI Research Papers
March 2025
What it does
Given a question like “What is the attention mechanism?” or “How does chain-of-thought prompting work?”, this pipeline retrieves the most relevant excerpts from a curated set of AI research papers and generates a grounded answer — citing its sources, refusing to speculate beyond them.
The five papers indexed:
- Attention Is All You Need — Vaswani et al., 2017
- Language Models Are Few-Shot Learners (GPT-3) — Brown et al., 2020
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020
- Chain-of-Thought Prompting Elicits Reasoning in LLMs — Wei et al., 2022
- Sequence to Sequence Learning with Neural Networks — Sutskever et al., 2014
Architecture
The pipeline has two phases:
Ingest (one-time)
PDFs are loaded, split into 500-character chunks with 50-character overlap, embedded using all-MiniLM-L6-v2, and stored in a local ChromaDB vector database. The result: 607 chunks, persisted to disk.
Query (interactive) A question is embedded with the same model, then used to retrieve the top 4 most semantically similar chunks from ChromaDB. Those chunks are injected into a prompt and sent to a locally-running LLM via LM Studio’s OpenAI-compatible API. The answer is returned with source attribution.
Nothing leaves the machine at any point.
Key decisions
Chunking: 500 chars, 50-char overlap.
RecursiveCharacterTextSplitter splits on paragraph and sentence boundaries before character boundaries. The 10% overlap ensures sentences spanning chunk edges aren’t silently dropped.
Embedding model: all-MiniLM-L6-v2.
22 MB, runs on CPU, trained on over a billion sentence pairs. Sufficient for semantic search over curated academic text — and produces no API dependencies or egress.
Retrieval: k=4 chunks. Four chunks is roughly 2,000 characters — enough context for a coherent answer without overwhelming the LLM’s context window or diluting retrieval precision.
Temperature: 0.2. Low randomness keeps answers factual and reproducible. A Q&A system over research papers has no use for creativity.
LLM runtime: LM Studio + GGUF. Avoids CUDA dependencies. Any quantised model that fits in RAM works; the project was tested with DeepSeek-R1-Distill-Llama-8B.
Examples
You: What is the attention mechanism?
Answer: The attention mechanism allows the model to weigh the relevance of
each input token when producing each output token. Rather than compressing
the entire input into a fixed-length vector, attention computes a weighted
sum over all encoder hidden states, using learned query, key, and value
projections to determine relevance.
Sources: NIPS-2017-attention-is-all-you-need-Paper.pdf
You: What is chain-of-thought prompting and why does it help?
Answer: Chain-of-thought prompting encourages a language model to produce
intermediate reasoning steps before giving a final answer. Wei et al. show
that this significantly improves performance on arithmetic and commonsense
reasoning tasks, particularly in larger models — the benefit is minimal
below ~100B parameters.
Sources: chain-of-thought-prompting.pdf
Stack
- Python 3.10+
- LangChain + LangChain-Community
- ChromaDB (SQLite-backed, HNSW index)
sentence-transformers/all-MiniLM-L6-v2- LM Studio (local OpenAI-compatible API)
- PyPDF