Interface

VechordRegistry

class vechord.registry.VechordRegistry(namespace, url)[source]

Create a registry for the given namespace and PostgreSQL URL.

Parameters:
  • namespace (str) – the namespace for this registry, will be the prefix for all the tables registered.

  • url (str) – the PostgreSQL URL to connect to.

register(tables, create_index=True)[source]

Register the given tables to the registry.

This will create the tables in the database if not exists.

Parameters:
  • tables (list[type[Table]]) – a list of Table classes to be registered.

  • create_index (bool) – whether or not to create the index if not exists.

set_pipeline(pipeline)[source]

Set the pipeline to be executed in the run method.

run(*args, **kwargs)[source]

Execute the pipeline in a transactional manner.

All the args and kwargs will be passed to the first function in the pipeline. The pipeline will run in one transaction, and all the inject can only see the data inserted in this transaction (to guarantee only the new inserted data will be processed in this pipeline).

This will also return the final result of the last function in the pipeline.

select_by(obj, fields=None, limit=None)[source]

Retrieve the requested fields for the given object stored in the DB.

Parameters:
  • obj (Table) – the object to be retrieved, this should be a Table.partial_init() instance, which means given values will be used for filtering.

  • fields (Optional[Sequence[str]]) – the fields to be retrieved, if not set, all the fields will be retrieved.

  • limit (Optional[int]) – the maximum number of results to be returned, if not set, all the results will be returned.

Return type:

list[Table]

search_by_vector(cls, vec, topk=10, return_fields=None)[source]

Search the vector for the given Table class.

Parameters:
  • cls (type[Table]) – the Table class to be searched.

  • vec (ndarray) – the vector to be searched.

  • topk (int) – the number of results to be returned.

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

Return type:

list[Table]

search_by_multivec(cls, multivec, topk=10, return_fields=None, max_maxsim_tuples=1000, probe=None)[source]

Search the multivec for the given Table class.

Parameters:
  • cls (type[Table]) – the Table class to be searched.

  • multivec (ndarray) – the multivec to be searched.

  • topk (int) – the number of results to be returned.

  • max_maxsim_tuples (int) – the maximum number of tuples to be considered for the each vector in the multivec.

  • probe (Optional[int]) – TODO

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

Return type:

list[Table]

search_by_keyword(cls, keyword, topk=10, return_fields=None)[source]

Search the keyword for the given Table class.

Parameters:
  • cls (type[Table]) – the Table class to be searched.

  • keyword (str) – the keyword to be searched.

  • topk (int) – the number of results to be returned.

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

Return type:

list[Table]

remove_by(obj)[source]

Remove the given object from the DB.

Parameters:

obj (Table) – the object to be removed, this should be a Table.partial_init() instance, which means given values will be used for filtering.

insert(obj)[source]

Insert the given object to the DB.

inject(input=None, output=None)[source]

Decorator to inject the data for the function arguments & return value.

Parameters:
  • input (Optional[type[Table]]) – the input table to be retrieved from the DB. If not set, the function will require the input to be passed in the function call.

  • output (Optional[type[Table]]) – the output table to store the return value. If not set, the return value will be return to the caller in a list.

clear_storage(drop_table=False)[source]

Clear the storage of the registry.

Parameters:

drop_table (bool) – whether to drop the table after removing all the data.

Types

class vechord.spec.Vector(*args, **kwargs)[source]

Vector type with fixed dimension.

User can assign np.ndarray with np.float32 type or list[float] type.

class vechord.spec.ForeignKey(*args, **kwargs)[source]

Reference to another table’s attribute as a foreign key.

This should be used in the Annotated[] type hint.

class vechord.spec.PrimaryKeyAutoIncrease[source]

Primary key with auto-increment ID type.

class vechord.spec.Keyword[source]

Keyword type for text search.

User can assign the str type, it will be tokenized and converted to bm25vector in PostgreSQL.

classmethod with_model(model)[source]

TODO: test this

Return type:

Type

class vechord.spec.IndexColumn(name: str, index: BaseIndex)[source]
class vechord.spec.VectorIndex(distance=VectorDistance.L2, lists=1)[source]
class vechord.spec.MultiVectorIndex(lists=None)[source]
class vechord.spec.KeywordIndex(model='bert_base_uncased')[source]
class vechord.spec.Table[source]

Base class for table definition.

classmethod table_schema()[source]

Generate the table schema from the class attributes’ type hints.

Return type:

Sequence[tuple[str, str]]

classmethod vector_column()[source]

Get the vector column name.

Return type:

Optional[IndexColumn]

classmethod multivec_column()[source]

Get the multivec column name.

Return type:

Optional[IndexColumn]

classmethod keyword_column()[source]

Get the keyword column name.

Return type:

Optional[IndexColumn]

classmethod non_vec_columns()[source]

Get the column names that are not vector or keyword.

Return type:

Sequence[str]

classmethod keyword_tokenizer()[source]

Get the keyword tokenizer.

Return type:

Optional[str]

classmethod primary_key()[source]

Get the primary key column name.

Return type:

Optional[str]

todict()[source]

Convert the table instance to a dictionary.

This will ignore the default values.

Return type:

dict[str, Any]

Augment

class vechord.augment.BaseAugmenter[source]

Bases: ABC

abstractmethod reset(doc)[source]

Cache the document for augmentation.

class vechord.augment.GeminiAugmenter(model='models/gemini-1.5-flash-001', ttl_sec=600)[source]

Bases: BaseAugmenter

Gemini Augmenter.

Context caching is only available for stable models with fixed versions. Minimal cache token is 32768.

reset(doc)[source]

Reset the document.

augment_context(chunks)[source]

Generate the contextual chunks.

Return type:

list[str]

augment_query(chunks)[source]

Generate the queries for chunks.

Return type:

list[str]

summarize_doc()[source]

Summarize the document.

Return type:

str

Chunk

class vechord.chunk.BaseChunker[source]

Bases: ABC

class vechord.chunk.RegexChunker(size=1536, overlap=200, separator='[\\\\n\\\\r\\\\f\\\\v\\\\t?!.;]{1,}', concat='. ')[source]

Bases: BaseChunker

A simple regex-based chunker.

class vechord.chunk.SpacyChunker(model='en_core_web_sm')[source]

Bases: BaseChunker

A semantic sentence Chunker based on SpaCy.

This guarantees the generated chunks are sentences.

class vechord.chunk.WordLlamaChunker(size=1536)[source]

Bases: BaseChunker

A semantic chunker based on WordLlama.

This doesn’t guarantee the generated chunks are sentences.

class vechord.chunk.GeminiChunker(model='gemini-2.0-flash', size=1536)[source]

Bases: BaseChunker

A semantic chunker based on Gemini.

Embedding

class vechord.embedding.VecType(*values)[source]

Bases: Enum

class vechord.embedding.BaseEmbedding[source]

Bases: ABC

class vechord.embedding.SpacyDenseEmbedding(model='en_core_web_sm', dim=96)[source]

Bases: BaseEmbedding

Spacy Dense Embedding.

class vechord.embedding.GeminiDenseEmbedding(model='models/text-embedding-004', dim=768)[source]

Bases: BaseEmbedding

Gemini Dense Embedding.

class vechord.embedding.OpenAIDenseEmbedding(model='text-embedding-3-large', dim=3072)[source]

Bases: BaseEmbedding

OpenAI Dense Embedding.

class vechord.embedding.SpladePPSparseEmbedding(url, dim=30522, timeout_sec=10)[source]

Bases: BaseEmbedding

Evaluate

class vechord.evaluate.BaseEvaluator[source]

Bases: ABC

evaluate(chunk_ids, retrieves, measures=('map', 'ndcg', 'recall'))[source]

Evaluate the retrieval results for multiple queries.

static evaluate_one(truth_id, resp_ids, measures=('map', 'ndcg', 'recall'))[source]

Evaluate the retrieval results for a single query.

class vechord.evaluate.GeminiEvaluator(model='gemini-2.0-flash')[source]

Bases: BaseEvaluator

Evaluator using Gemini model to generate search queries.

Extract

class vechord.extract.BaseHTMLParser(*, convert_charrefs=Ellipsis)[source]

Bases: HTMLParser

A simple HTML parser to extract text content.

class vechord.extract.BaseExtractor[source]

Bases: ABC

class vechord.extract.SimpleExtractor[source]

Bases: BaseExtractor

Local extractor for text files.

extract_pdf(doc)[source]

Extract text from PDF using pypdfium2.

Return type:

str

extract_html(text)[source]

Extract text from HTML.

Parameters:

text (str) – HTML text.

Return type:

str

class vechord.extract.GeminiExtractor(model='gemini-2.0-flash')[source]

Bases: SimpleExtractor

Extract text with Gemini model.

extract_pdf(doc)[source]

Extract text from PDF page by page.

Return type:

str

Load

class vechord.load.BaseLoader[source]

Bases: ABC

class vechord.load.LocalLoader(path, include=None)[source]

Bases: BaseLoader

Load documents from local file system.

class vechord.load.S3Loader(bucket, prefix, include=None)[source]

Bases: BaseLoader

Rerank

class vechord.rerank.BaseReranker[source]

Bases: ABC

abstractmethod rerank(query, chunks)[source]

Return the indices of the reranked chunks.

Return type:

list[str]

class vechord.rerank.CohereReranker(model='rerank-v3.5')[source]

Bases: BaseReranker

Rerank chunks using Cohere API (requires env COHERE_API_KEY).

rerank(query, chunks)[source]

Return the indices of the reranked chunks.

Return type:

list[int]

class vechord.rerank.ReciprocalRankFusion(k=60)[source]

Bases: object

Fuse chunks using reciprocal rank.

Service

vechord.service.create_web_app(registry)[source]

Create a Falcon WSGI application for the given registry.

This includes the: - health check [GET](/) - tables [GET/POST/DELETE](/api/table/{table_name}) - pipeline in a transaction [POST](/api/pipeline) - OpenAPI spec and Swagger UI [GET](/openapi/swagger)

Return type:

App