Guide¶
Define the table¶
Inherite the Table
class and define the columns as attributes with the
type hints. Some advanced configuration can be done by using the typing.Annotated
.
Choose a primary key¶
PrimaryKeyAutoIncrease
: generate an auto-incrementing integer as the primary keyPrimaryKeyUUID
: useuuid7
as the primary key, suitable for distributed systems or general purposesint
orstr
: insert the key manually
Vector and Keyword search¶
Vector
: define a vector column with dimensions, it’s recommended to define something likeDenseVector = Vector[768]
and use it in all tables. This acceptslist[float]
ornumpy.ndarray
as the input. For now, it only supportsf32
type.for multivector, use
list[DenseVector]
as the type hint
Keyword
: define a keyword column that thestr
will be tokenized and stored as thebm25vector
type. This acceptsstr
as the input.
Configure the Index¶
The default index is suitable for small datasets (less than 100k). For larger datasets, you can
customize the index configuration by using the typing.Annotated
with:
VectorIndex
: configure thelists
anddistance
operators.MultiVectorIndex
: configure thelists
.
DenseVector = Vector[768]
class MyTable(Table, kw_only=True):
uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory)
vec: Annotated[DenseVector, VectorIndex(lists=128)]
text: str
Tip
If you need to use a customized tokenizer, please refer to the VectorChord-bm25 document.
Use the foreign key to link tables¶
By default, the foreign key will add REFERENCES ON DELETE CASCADE
.
class SubTable(Table, kw_only=True):
uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory)
text: str
mytable_uid: Annotated[UUID, ForeignKey[MyTable.uid]]
JSONB¶
If you want to store a JSONB column, you can define like:
from psycopg.types.json import Jsonb
class MyJsonTable(Table, kw_only=True):
uid: PrimaryKeyUUID = msgspec.field(default_factory=PrimaryKeyUUID.factory)
json: JSONB
item = MyJsonTable(json=Jsonb({"key": "value"}))
Inject with decorator¶
The decorator inject()
can be used to load the
function arguments from the database and dump the return values to the database.
To use this decorator, you need to specify at least one of the input
or output
with
the table class you have defined.
input=Type[Table]
: will load the specified columns rom the database and inject the data to the decorated function argumentsif
input=None
, the function will need to pass the arguments manually
output=Type[Table]
: will dump the return values to the database (will also need to annotate the return type with the provided table class or a list of the table class)if
output=None
, you can get the return value from the functiona call
The following example uses the pre-defined tables:
from uuid import UUID
import httpx
from vechord.registry import VechordRegistry
from vechord.extract import SimpleExtractor
from vechord.embedding import GeminiDenseEmbedding
from vechord.spec import DefaultDocument, create_chunk_with_dim
DefaultChunk = create_chunk_with_dim(768)
vr = VechordRegistry(namespace="test", url="postgresql://postgres:postgres@127.0.0.1:5432/")
vr.register([DefaultDocument, DefaultChunk])
extractor = SimpleExtractor()
emb = GeminiDenseEmbedding()
@vr.inject(output=DefaultDocument)
def add_document(url: str) -> DefaultDocument:
with httpx.Client() as client:
resp = client.get(url)
text = extractor.extract_html(resp.text)
return DefaultDocument(title=url, text=text)
@vr.inject(input=Document, output=DefaultChunk)
def add_chunk(uid: UUID, text: str) -> list[DefaultChunk]:
chunks = text.split("\n")
return [DefaultChunk(doc_id=uid, vec=emb.vectorize_chunk(t), text=t) for t in chunks]
for url in ["https://paulgraham.com/best.html", "https://paulgraham.com/read.html"]:
add_document(url)
add_chunk()
Select/Insert/Delete¶
We also provide some functions to select, insert and delete the data from the database.
docs = vr.select_by(DefaultDocument.partial_init())
vr.insert(DefaultDocument(text="hello world"))
vr.copy_bulk([DefaultDocument(text="hello world"), DefaultDocument(text="hello vector")])
vr.remove_by(DefaultDocument.partial_init())
Transaction¶
Use the VechordPipeline
to run multiple functions in a transaction.
This also guarantees that the decorated functions will only load the data from the current transaction instead of the whole table. So users can focus on the data processing part.
pipeline = vr.create_pipeline([add_document, add_chunk])
pipeline.run("https://paulgraham.com/best.html")
Search¶
We provide search interface for different types of queries:
vr.search_by_vector(DefaultChunk, emb.vectorize_query("hey"), topk=10)
Access the cursor¶
If you need to change some settings or use the cursor directly:
vr.client.get_cursor().execute("SET vchordrq.probes = 100;")