API Reference

VechordRegistry

class vechord.registry.VechordPipeline(client, steps)[source]

Set up the pipeline to run multiple functions in a transaction.

Parameters:
  • client (VechordClient) – VectorChordClient to be used for the transaction.

  • steps (list[Callable]) – a list of functions to be run in the pipeline. The first function will be used to accept the input, and the last function will be used to return the output. The rest of the functions will be used to process the data in between. The functions will be run in the order they are defined in the list.

run(*args, **kwargs)[source]

Execute the pipeline in a transactional manner.

All the args and kwargs will be passed to the first function in the pipeline. The pipeline will run in one transaction, and all the inject can only see the data inserted in this transaction (to guarantee only the new inserted data will be processed in this pipeline).

This will also return the final result of the last function in the pipeline.

Return type:

Any

class vechord.registry.VechordRegistry(namespace, url)[source]

Create a registry for the given namespace and PostgreSQL URL.

Parameters:
  • namespace (str) – the namespace for this registry, will be the prefix for all the tables registered.

  • url (str) – the PostgreSQL URL to connect to.

register(tables, create_index=True)[source]

Register the given tables to the registry.

This will create the tables in the database if not exists.

Parameters:
  • tables (list[type[Table]]) – a list of Table classes to be registered.

  • create_index (bool) – whether or not to create the index if not exists.

create_pipeline(steps)[source]

Create the VechordPipeline to run multiple functions in a transaction.

Parameters:

steps (list[Callable]) – a list of functions to be run in the pipeline.

Return type:

VechordPipeline

select_by(obj, fields=None, limit=None)[source]

Retrieve the requested fields for the given object stored in the DB.

Parameters:
  • obj (TypeVar(T, bound= Table)) – the object to be retrieved, this should be generated from Table.partial_init(), while the given values will be used for filtering (= or is).

  • fields (Optional[Sequence[str]]) – the fields to be retrieved, if not set, all the fields will be retrieved.

  • limit (Optional[int]) – the maximum number of results to be returned, if not set, all the results will be returned.

Return type:

list[TypeVar(T, bound= Table)]

search_by_vector(cls, vec, topk=10, return_fields=None, probe=None)[source]

Search the vector for the given Table class.

Parameters:
  • cls (type[TypeVar(T, bound= Table)]) – the Table class to be searched.

  • vec (ndarray) – the vector to be searched.

  • topk (int) – the number of results to be returned.

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

  • probe (Optional[int]) – how many K-means clusters to probe for the vec.

Return type:

list[TypeVar(T, bound= Table)]

search_by_multivec(cls, multivec, topk=10, return_fields=None, maxsim_refine=1000, probe=None)[source]

Search the multivec for the given Table class.

Parameters:
  • cls (type[TypeVar(T, bound= Table)]) – the Table class to be searched.

  • multivec (ndarray) – the multivec to be searched.

  • topk (int) – the number of results to be returned.

  • maxsim_refine (int) – the maximum number of document vectors to be compute with full-precision for each vector in the multivec. 0 means all the distances are compute with bit quantization.

  • probe (Optional[int]) – how many K-means clusters to probe for each vector in the multivec.

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

Return type:

list[TypeVar(T, bound= Table)]

search_by_keyword(cls, keyword, topk=10, return_fields=None)[source]

Search the keyword for the given Table class.

Parameters:
  • cls (type[TypeVar(T, bound= Table)]) – the Table class to be searched.

  • keyword (str) – the keyword to be searched.

  • topk (int) – the number of results to be returned.

  • return_fields (Optional[Sequence[str]]) – the fields to be returned, if not set, all the non-[vector,keyword] fields will be returned.

Return type:

list[TypeVar(T, bound= Table)]

remove_by(obj)[source]

Remove the given object from the DB.

Parameters:

obj (Table) – the object to be removed, this should be a Table.partial_init() instance, which means given values will be used for filtering.

insert(obj)[source]

Insert the given object to the DB.

Parameters:

obj (Table) – the object to be inserted

copy_bulk(objs)[source]

Insert the given list of objects to the DB.

This is more efficient than calling insert for each object.

Parameters:

objs (list[Table]) – the list of objects to be inserted, needs to be of the same class and filled with the same fields. The class should be a subclass of Table.

inject(input=None, output=None)[source]

Decorator to inject the data for the function arguments & return value.

Parameters:
  • input (Optional[type[Table]]) – the input table to be retrieved from the DB. If not set, the function will require the input to be passed in the function call.

  • output (Optional[type[Table]]) – the output table to store the return value. If not set, the return value will be return to the caller in a list.

clear_storage(drop_table=False)[source]

Clear the storage of the registry.

Parameters:

drop_table (bool) – whether to drop the table after removing all the data.

VechordClient

class vechord.client.VechordClient(namespace, url)[source]

A PostgreSQL client to access the database.

Parameters:
  • namespace (str) – used as a prefix for the table name.

  • url (str) – the database connection URL. e.g. “postgresql://user:password@localhost:5432/dbname”

transaction()[source]

Create a transaction context manager (when there is no transaction).

get_cursor()[source]

Get the current cursor or create a new one.

select(name, raw_columns, kvs=None, from_buffer=False, limit=None)[source]

Select from db table with optional key-value condition or from un-committed transaction buffer.

  • from_buffer: this ensures the select query only returns the rows that are

    inserted in the current transaction.

Types

class vechord.spec.DefaultDocument(*, uid: ~vechord.spec.PrimaryKeyUUID = <factory>, title: str = '', text: str, created_at: ~datetime.datetime = <factory>)[source]

Default Document table class.

class vechord.spec.ForeignKey(*args, **kwargs)[source]

Reference to another table’s attribute as a foreign key.

This should be used in the Annotated[] type hint.

class vechord.spec.Keyword[source]

Keyword type for text search. (wrap str)

User can assign the str type, it will be tokenized and converted to bm25vector in PostgreSQL.

class vechord.spec.KeywordIndex(model='bert_base_uncased')[source]
class vechord.spec.MultiVectorIndex(lists=None)[source]
class vechord.spec.PrimaryKeyAutoIncrease[source]

Primary key with auto-increment ID type. (wrap int)

class vechord.spec.PrimaryKeyUUID(hex=None, bytes=None, bytes_le=None, fields=None, int=None, version=None, *, is_safe=SafeUUID.unknown)[source]

Primary key with UUID type. (wrap UUID)

This doesn’t come with auto-generate, because PostgreSQL doesn’t support UUID v7, while v4 is purely random and not sortable.

Choose this one over PrimaryKeyAutoIncrease when you need universal uniqueness.

We suggest to use:

class MyTable(Table):
    uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory)
class vechord.spec.Table[source]

Base class for table definition.

classmethod table_schema()[source]

Generate the table schema from the class attributes’ type hints.

Return type:

Sequence[tuple[str, str]]

classmethod table_psql_types()[source]

Generate the corresponding PostgreSQL types for each column.

Return type:

Sequence[tuple[str, str]]

classmethod vector_column()[source]

Get the vector column name.

Return type:

Optional[IndexColumn[VectorIndex]]

classmethod multivec_column()[source]

Get the multivec column name.

Return type:

Optional[IndexColumn[MultiVectorIndex]]

classmethod keyword_column()[source]

Get the keyword column name.

Return type:

Optional[IndexColumn[KeywordIndex]]

classmethod non_vec_columns()[source]

Get the column names that are not vector or keyword.

Return type:

Sequence[str]

classmethod keyword_tokenizer()[source]

Get the keyword tokenizer.

Return type:

Optional[str]

classmethod primary_key()[source]

Get the primary key column name.

Return type:

Optional[str]

todict()[source]

Convert the table instance to a dictionary.

This will ignore the values like: :rtype: dict[str, Any]

  • msgspec.UNSET

  • default value is None and the value is also None (mainly for PrimaryKeyAutoIncrease)

class vechord.spec.Vector(*args, **kwargs)[source]

Vector type with fixed dimension.

User can assign np.ndarray with np.float32 type or list[float] type.

class vechord.spec.VectorIndex(distance=VectorDistance.L2, lists=None)[source]
vechord.spec.create_chunk_with_dim(dim)[source]

Create a chunk table class with a specific vector dimension.

This comes with vector and keyword column. It also has a foreign key to the DefaultDocument table. (If this is used, the DefaultDocument table must be registered too.)

Parameters:

dim (int) – vector dimension.

Return type:

Type[_DefaultChunk]

Augment

class vechord.augment.BaseAugmenter[source]

Bases: ABC

abstractmethod reset(doc)[source]

Cache the document for augmentation.

class vechord.augment.GeminiAugmenter(model='models/gemini-1.5-flash-001', ttl_sec=600)[source]

Bases: BaseAugmenter

Gemini Augmenter.

Context caching is only available for stable models with fixed versions. Minimal cache token is 32768.

reset(doc)[source]

Reset the document.

augment_context(chunks)[source]

Generate the contextual chunks.

Return type:

list[str]

augment_query(chunks)[source]

Generate the queries for chunks.

Return type:

list[str]

summarize_doc()[source]

Summarize the document.

Return type:

str

Chunk

class vechord.chunk.BaseChunker[source]

Bases: ABC

class vechord.chunk.RegexChunker(size=1536, overlap=200, separator='[\\\\n\\\\r\\\\f\\\\v\\\\t?!.;]{1,}', concat='. ')[source]

Bases: BaseChunker

A simple regex-based chunker.

class vechord.chunk.SpacyChunker(model='en_core_web_sm')[source]

Bases: BaseChunker

A semantic sentence Chunker based on SpaCy.

This guarantees the generated chunks are sentences.

class vechord.chunk.WordLlamaChunker(size=1536)[source]

Bases: BaseChunker

A semantic chunker based on WordLlama.

This doesn’t guarantee the generated chunks are sentences.

class vechord.chunk.GeminiChunker(model='gemini-2.0-flash', size=1536)[source]

Bases: BaseChunker

A semantic chunker based on Gemini.

Embedding

class vechord.embedding.VecType(*values)[source]

Bases: Enum

class vechord.embedding.BaseEmbedding[source]

Bases: ABC

class vechord.embedding.SpacyDenseEmbedding(model='en_core_web_sm', dim=96)[source]

Bases: BaseEmbedding

Spacy Dense Embedding.

class vechord.embedding.GeminiDenseEmbedding(model='models/text-embedding-004', dim=768)[source]

Bases: BaseEmbedding

Gemini Dense Embedding.

class vechord.embedding.OpenAIDenseEmbedding(model='text-embedding-3-large', dim=3072)[source]

Bases: BaseEmbedding

OpenAI Dense Embedding.

class vechord.embedding.SpladePPSparseEmbedding(url, dim=30522, timeout_sec=10)[source]

Bases: object

Splade++ Sparse Embedding.

Evaluate

class vechord.evaluate.BaseEvaluator[source]

Bases: ABC

evaluate(chunk_ids, retrieves, measures=('map', 'ndcg', 'recall'))[source]

Evaluate the retrieval results for multiple queries.

static evaluate_one(truth_id, resp_ids, measures=('map', 'ndcg', 'recall'))[source]

Evaluate the retrieval results for a single query.

class vechord.evaluate.GeminiEvaluator(model='gemini-2.0-flash')[source]

Bases: BaseEvaluator

Evaluator using Gemini model to generate search queries.

Extract

class vechord.extract.BaseHTMLParser[source]

Bases: HTMLParser

A simple HTML parser to extract text content.

class vechord.extract.BaseExtractor[source]

Bases: ABC

class vechord.extract.SimpleExtractor[source]

Bases: BaseExtractor

Local extractor for text files.

extract_pdf(doc)[source]

Extract text from PDF using pypdfium2.

Return type:

str

extract_html(text)[source]

Extract text from HTML.

Parameters:

text (str) – HTML text.

Return type:

str

class vechord.extract.GeminiExtractor(model='gemini-2.0-flash')[source]

Bases: SimpleExtractor

Extract text with Gemini model.

extract_pdf(doc)[source]

Extract text from PDF page by page.

Return type:

str

Load

class vechord.load.BaseLoader[source]

Bases: ABC

class vechord.load.LocalLoader(path, include=None)[source]

Bases: BaseLoader

Load documents from local file system.

class vechord.load.S3Loader(bucket, prefix, include=None)[source]

Bases: BaseLoader

Rerank

class vechord.rerank.BaseReranker[source]

Bases: ABC

abstractmethod rerank(query, chunks)[source]

Return the indices of the reranked chunks.

Return type:

list[int]

class vechord.rerank.CohereReranker(model='rerank-v3.5')[source]

Bases: BaseReranker

Rerank chunks using Cohere API (requires env COHERE_API_KEY).

rerank(query, chunks)[source]

Return the indices of the reranked chunks.

Return type:

list[int]

class vechord.rerank.ReciprocalRankFusion(k=60)[source]

Bases: object

Fuse chunks using reciprocal rank.

Service

vechord.service.create_web_app(registry, pipeline)[source]

Create a Falcon WSGI application for the given registry.

This includes the: :rtype: App

  • health check [GET](/)

  • tables [GET/POST/DELETE](/api/table/{table_name})

  • pipeline in a transaction [POST](/api/pipeline)

  • OpenAPI spec and Swagger UI [GET](/openapi/swagger)