Helper for vector-store pipelines. If called without `x`, this returns a closure that can be passed directly to `insert_vectors(embed_fun = ...)`.
Initializes a DuckDB database connection for storing embedded documents, with optional support for the experimental `vss` extension.
Chunks long text rows, generates embeddings when needed, and inserts `(page_content, embedding)` rows into the `vectors` table.
Builds HNSW (`vss`) and/or full-text (`fts`) indexes on the `vectors` table.
Embeds `query_text`, computes vector distance against stored embeddings, and returns the nearest matches.
Usage
embed_openai(
x,
model = "text-embedding-ada-002",
base_url = "https://api.openai.com/v1",
api_key = Sys.getenv("OPENAI_API_KEY"),
batch_size = 20L,
embedding_dim = 1536
)
create_vectorstore(
db_path = ":memory:",
overwrite = FALSE,
embedding_dim = 1536,
load_vss = identical(Sys.getenv("_R_CHECK_PACKAGE_NAME_"), "")
)
insert_vectors(
con,
df,
embed_fun = embed_openai(),
chunk_chars = 12000,
embedding_dim = 1536
)
build_vector_index(store, type = c("vss", "fts"))
search_vectors(
con,
query_text,
top_k = 5,
embed_fun = embed_openai(),
embedding_dim = 1536
)Arguments
- x
Character vector of texts, or a data frame with a `page_content` column.
- model
OpenAI embedding model name.
- base_url
Base URL for an OpenAI-compatible API.
- api_key
API key; defaults to `Sys.getenv("OPENAI_API_KEY")`.
- batch_size
Batch size for embedding requests.
- embedding_dim
Integer; the dimensionality of the vector embeddings to store.
- db_path
Path to the DuckDB file. Use `":memory:"` to create an in-memory database.
- overwrite
Logical; if `TRUE`, deletes any existing DuckDB file or table.
- load_vss
Logical; whether to load the experimental `vss` extension. This defaults to `TRUE`, but is forced to `FALSE` during CRAN checks.
- con
Active DuckDB DBI connection.
- df
Data frame containing `page_content` (or `content`) text.
- embed_fun
Function used to convert text into numeric embeddings.
- chunk_chars
Approximate max chunk size in bytes before splitting.
- store
Active DuckDB DBI connection or vector-store handle.
- type
Index types to build; any of `"vss"` and/or `"fts"`.
- query_text
Query text to embed and search.
- top_k
Number of nearest matches to return.
Value
For character input, a numeric matrix of embeddings. For data-frame input, the same data frame with an added `embedding` column. If `x` is missing, a configured embedding function is returned.
A live DuckDB connection object. Be sure to manually disconnect with:
DBI::dbDisconnect(con, shutdown = TRUE)
Details
This function is part of the vector-store utilities for:
Embedding text via the OpenAI API
Storing and chunking documents in DuckDB
Building `HNSW` and `FTS` indexes
Running nearest-neighbour search over vector embeddings
Core helpers like embed_openai(), insert_vectors(),
build_vector_index(), and search_vectors() are also exported
to support composable workflows.
Examples
if (FALSE) { # \dontrun{
# Create vector store
con <- create_vectorstore("tests/testthat/test-data/my_vectors.duckdb", overwrite = TRUE)
# Assume response is output from fetch_data()
docs <- data.frame(head(response))
# Insert documents with embeddings
insert_vectors(
con = con,
df = docs,
embed_fun = embed_openai(),
chunk_chars = 12000
)
# Build vector + FTS indexes
build_vector_index(con, type = c("vss", "fts"))
# Perform vector search
response <- search_vectors(con, query_text = "Tell me about R?", top_k = 5)
} # }