rag-core is an opinionated but modular RAG (Retrieval-Augmented Generation) pipeline that handles the full journey from raw documents to grounded LLM responses. It is designed to make smart defaults easy while keeping every component swappable.
Ingest text, markdown, PDF, or CSV files into a standard Document format
Split documents into overlapping chunks using recursive, semantic, or fixed-size strategies
Convert chunks to vector embeddings via local models or the OpenAI API
Index embeddings for fast similarity search using in-memory numpy or ChromaDB
Find the top-k relevant chunks and re-rank them with metadata boosters
Assemble a grounded prompt with retrieved context and source citations
Recursive chunking tries the largest natural boundaries first (double newlines, then single newlines, then sentences) before falling back to character splits. This preserves semantic coherence better than fixed-size chunking while still guaranteeing a maximum chunk size. In contrast, fixed-size chunks can split mid-sentence, degrading retrieval precision. Semantic chunking works well for structured content like markdown, but can produce unpredictable chunk sizes. Recursive chunking balances precision, context preservation, and compute cost.
Cosine similarity measures the angle between two vectors, making it magnitude-invariant. This is critical for text embeddings because document length can vary dramatically. Two paragraphs about the same topic will have similar directions in embedding space even if one is twice as long. Dot product is faster but sensitive to magnitude. Euclidean distance works well for normalized vectors but can be misleading for raw embeddings of different lengths.
The library supports both sentence-transformers (local, free, private) and OpenAI embeddings (higher quality, API cost). Local models like all-MiniLM-L6-v2 produce 384-dimensional embeddings with reasonable quality for most use cases. OpenAI's text-embedding-3-small produces 1536 dimensions with stronger semantic understanding. The choice depends on your privacy requirements, budget, and quality needs. For development and testing, local embeddings eliminate API dependencies entirely.
The embedding cache uses a content hash (text + model name) as the key, stored as a numpy .npz file on disk. This means re-ingesting unchanged documents skips the expensive embedding step entirely. For large document collections, this can reduce pipeline run time from minutes to seconds. The cache invalidates automatically when document content changes because the hash changes.
from rag_core import RAGPipeline
from rag_core.loaders import MarkdownLoader
from rag_core.embeddings import LocalEmbeddings
# Initialize pipeline with smart defaults
pipeline = RAGPipeline(
embedding_provider=LocalEmbeddings(),
chunk_strategy="recursive",
chunk_size=512,
chunk_overlap=50,
)
# Ingest documents
docs = MarkdownLoader.load_directory("./docs/")
pipeline.ingest(docs)
# Query
response = pipeline.query(
"What are the key benefits of event-driven architecture?"
)
print(response.answer)
print(f"Sources: {response.sources}")
print(f"Confidence: {response.confidence_score}")# Fixed-size: simple but can split mid-sentence
fixed = FixedSizeChunker(chunk_size=500, overlap=50)
# Semantic: splits at paragraph/header boundaries
semantic = SemanticChunker(max_chunk_size=800)
# Recursive (default): tries natural boundaries first
recursive = RecursiveChunker(
chunk_size=512,
overlap=50,
separators=["\n\n", "\n", ". ", " "]
)Click "Run Query" to step through the RAG pipeline
Clone the repo, run the tests, and try it with your own documents.
View on GitHub