Problem
AI assistants can hallucinate or answer unsupported questions when they respond without checking whether the available evidence actually supports the answer.
Prototype / Academic Project
A small explainable Retrieval-Augmented Generation prototype that demonstrates grounded answer behavior, retrieval thresholds, and refusal when local evidence is insufficient.
At a glance
Context
Prototype / Academic Project
Current state
Local RAG prototype
Role
Sole builder for ingestion, retrieval, thresholding, generation modes, and benchmark notes
Screenshot placeholder
Actual screenshots are not included in this repository yet. This placeholder avoids inventing visuals while reserving space for dashboard, terminal, or demo evidence.
7
local source documents
15
searchable chunks
15 × 772
TF-IDF matrix shape
TF-IDF + cosine similarity
retrieval method
AI assistants can hallucinate or answer unsupported questions when they respond without checking whether the available evidence actually supports the answer.
I built a local RAG prototype that ingests text files, chunks them, indexes the chunks with TF-IDF, retrieves evidence with cosine similarity, checks a minimum relevance threshold, and only answers when the retrieved context clears that threshold. When the local evidence is insufficient, the assistant refuses instead of guessing.
Indexed 7 local source documents into 15 searchable chunks and produced 7/7 useful retrieval/refusal decisions on a small sanity benchmark using TF-IDF + cosine similarity with a default 0.12 threshold.
These numbers describe project artifacts and sanity checks. They are not client ROI, deployment adoption, actuarial accuracy, or broad model-accuracy claims.
7
local source documents
15
searchable chunks
15 × 772
TF-IDF matrix shape
TF-IDF + cosine similarity
retrieval method
0.12
default minimum relevance threshold
7/7
useful retrieval/refusal decisions on the small benchmark
The pipeline is shown as explicit stages so the system boundary is inspectable.
Seven local source documents provide the bounded knowledge corpus.
Text files are loaded into the prototype for local processing.
Documents are split into 15 searchable chunks.
The chunk corpus becomes a 15 × 772 TF-IDF matrix.
Queries retrieve top-k chunks by lexical similarity.
The assistant answers only when retrieved context clears the minimum relevance threshold.
In-domain questions are answered from local evidence; unsupported questions are refused.
Where the pattern matters
Available artifacts are labeled directly. Missing visuals stay as placeholders until real screenshots are added.
The prototype demonstrates both grounded answers and explicit refusal when the local corpus does not support a question.
The flow is intentionally small and inspectable: local files become chunks, chunks become TF-IDF features, retrieval is thresholded, and generation depends on retrieved evidence.
The main applied lesson is the refusal boundary, not broad RAG accuracy.
The evidence is intentionally modest and quantified as a sanity benchmark.
What this does not claim
Reasonable next steps
A staged view of how RAGeATM could grow from simple lexical retrieval into a more measurable, semantic, context-aware, and eventually multimodal research-assistant harness.
RAGeATM is currently best understood as a small but useful RAG prototype: enough to demonstrate retrieval, grounding, and evaluation discipline, but not yet a production research platform. The next work is not simply to make it bigger. The stronger path is to make retrieval more measurable, reproducible, semantic, and context-aware while avoiding overclaims about what current AI systems truly understand.
TF-IDF and BM25 retrieve based primarily on lexical overlap, while embedding-based and LLM-assisted retrieval can better capture semantic similarity, paraphrase, and conceptual relevance. This makes them more capable of retrieving documents related to the user’s underlying intent, although they should not be described as fully understanding the ‘question beneath the question’ in a human sense.
LLMs can approximate deeper intent by modeling semantic context, conversational history, and inferred goals, but this remains probabilistic pattern-based reasoning rather than true human understanding.
| Level | System type | What it compares | Meaning captured | Question under the question ability | Personal context ability | Real-world grounding | Best use case | Fatal weakness |
|---|---|---|---|---|---|---|---|---|
| 1 | Exact keyword search | Literal word/string overlap | 5% | 0% | 0% | 0% | Finding exact names, IDs, phrases, codes | Misses anything phrased differently |
| 2 | TF-IDF | Weighted term overlap | 10-20% | 0-5% | 0% | 0% | Simple document retrieval where vocabulary matches | No real semantics; treats text as bag-of-words |
| 3 | BM25 | Improved keyword relevance with saturation/length normalization | 20-35% | 5% | 0% | 0% | Strong classic search baseline | Still mostly lexical; synonyms and paraphrases are weak |
| 4 | Static embeddings | Word/document vectors learned from language patterns | 35-50% | 10-20% | 0-5% | 0% | Finding semantically related text | Limited context sensitivity |
| 5 | Modern embedding models | Query/document meaning vectors | 55-75% | 25-45% | 5-15% | 0-5% | RAG retrieval, semantic search, paraphrase matching | Can retrieve conceptually similar but wrong context |
| 6 | Hybrid search | BM25 + embeddings | 65-85% | 30-50% | 5-15% | 0-5% | Serious RAG systems | More complex; requires tuning and evaluation |
| 7 | Reranked retrieval | Initial retrieval + LLM/cross-encoder relevance judgment | 75-90% | 40-60% | 10-20% | 0-5% | High-quality RAG retrieval | Slower/costlier; still depends on retrieved candidates |
| 8 | LLM reading retrieved context | Retrieved docs + generated reasoning | 80-95% for answer synthesis | 50-70% | 15-35% | 0-10% | Answering from documents with explanation | Can hallucinate, overgeneralize, or sound more certain than it is |
| 9 | LLM with memory/user profile | Query + history + user goals + documents | 80-95% | 65-80% | 50-75% | 5-15% | Personalized assistants, tutoring, coaching, project guidance | Risk of assuming too much about the user |
| 10 | Agentic AI with tools | Text + memory + documents + actions + external systems | 85-95% | 70-85% | 60-80% | 20-45% | Research assistants, workflow automation, coding agents | Tool errors, bad planning, weak verification |
| 11 | Multimodal grounded AI | Text + vision + audio + environment + actions | 85-98% | 75-90% | 70-85% | 50-75% | Real-world assistance, robotics, field analysis | Still not human lived experience |
| 12 | Human-level social/contextual understanding | Language + memory + embodiment + relationships + lived experience | 95-100% | 90-100% | 90-100% | 90-100% | Real relational discernment | Current AI does not truly have this |
These percentages are heuristic gauges, not universal benchmark results. They are meant to communicate increasing capability scope, not claim exact measured performance.
| Method | What it really knows |
|---|---|
| TF-IDF | "These documents share important words with the query." |
| BM25 | "These documents share important words in a more search-optimized way." |
| Embeddings | "These documents are conceptually close to the query." |
| Hybrid retrieval | "These documents match both the words and the meaning." |
| Reranking | "Of the retrieved documents, these are probably most relevant to the user’s actual question." |
| LLM + memory | "Given this user’s history, goals, and wording, this may be what they are really asking." |
| Grounded AI | "Given the person’s behavior, environment, constraints, and history, this is probably the deeper issue." |
The practical future work for RAGeATM is to climb this ladder carefully: first by improving reproducibility and evaluation, then by comparing lexical, embedding, hybrid, and reranked retrieval, then by testing whether memory, user goals, and multimodal inputs actually improve retrieval quality without creating unjustified confidence.
The public code link is provided for review of the prototype and technical approach. This does not represent paid deployment, production adoption, or client ROI unless stated elsewhere on the page.
More portfolio context.
A Minnesota severe-weather analytics dashboard that turns large NOAA weather datasets into county-level risk views, cleaned analytics layers, and decision-support reporting surfaces.
Workflow orchestration layer in active development for managing state, decision flow, and human review inside StormIQ.