Skip to main content

Datastores

API for storing, indexing and retrieving document corpora, powered by Elasticsearch and FAISS.

Uploading Documents

If you want to upload your collection of documents to UKP-SQuARE, you need to create an empty Datastore first, and then upload your documents. This will automatically create a BM25 index for your documents. You can do this with simple REST API calls. Please, follow this Notebook tutorial to for more details https://colab.research.google.com/drive/1YZkrDOSaJVxrphTx-M22bpKZLHSAZs99?usp=sharing.

In summary, the Datastore API provides different methods for uploading documents. Documents are expected to be uploaded as .jsonl files.

We first create a demo datastore to upload some documents to:

curl -X 'PUT' \
'http://localhost:7000/datastores/demo' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"name": "id",
"type": "long"
},
{
"name": "title",
"type": "text"
},
{
"name": "text",
"type": "text"
}
]'

Some example documents adhering to the required format can be found at https://github.com/UKP-SQuARE/square-core/blob/master/datastore-api/tests/fixtures/0.jsonl. We can upload these documents to the Datastore API as follows:

curl -X 'POST' \
'http://localhost:7000/datastores/wiki/demo/upload' \
-H 'Authorization: abcdefg' \
-F 'file=@tests/fixtures/0.jsonl'

Configure dense retrieval with FAISS

To enable dense document retrieval with FAISS, the Datastore API relies on FAISS web service containers that provide FAISS indices for the documents in a datastore. Each index in each datastore is corresponds to one FAISS web service container. The document embedding computation and FAISS index creation are performed offline, i.e. not via the Datastore API itself.

Let's see how to add a dense retrieval index to an existing datastore ("wiki"). The new index should use Facebook's DPR model and should be called "dpr".

  1. Embed the document corpus using the document encoder model & create a FAISS index in the correct format. Refer to https://github.com/kwang2049/faiss-instant for more on this.

  2. Register the new index with its name via the Datastore API:

    curl -X 'PUT' \
    'http://localhost:7000/datastores/wiki/indices/dpr' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "doc_encoder_model": "facebook/dpr-ctx_encoder-single-nq-base",
    "query_encoder_model": "facebook/dpr-question_encoder-single-nq-base",
    "embedding_size": 768
    }'
  3. Specify the FAISS web service container for the new index: Open the docker-compose.yml and in the section for FAISS service containers, add the following:

    faiss-wiki-dpr:
    image: kwang2049/faiss-instant:latest
    volumes:
    - /local/path/to/index:/opt/faiss-instant/resources
    labels:
    - "traefik.enable=true"
    - "traefik.http.services.faiss-wiki-dpr.loadbalancer.server.port=5000"
    - "square.datastore=/wiki/dpr"
  4. Restart the Docker Compose setup:

    docker compose up -d

Local Deployment (Advanced users)

You can deploy the Datastore service on your own machine. The Datastore API is dependent upon the following services:

  • Required (automatically via Docker):
    • Elasticsearch (for storing documents and sparse retrieval)
    • Traefik (for routing search requests)
  • Optional (manual setup required):
    • FAISS web service containers (for storing dense document embeddings): see the section on dense retrieval on how to setup.
    • SQuARE Model API (for dense document retrieval)

Quick (production) setup

  1. Open the docker-compose.yml. Find the service declaration for datastore_api and uncomment it. In the environment section, optionally set an API key and the connection to the Model API.

  2. Run the Docker setup:

    docker compose up -d

    Check http://localhost:7000/docs for interactive documentation.

  3. Upload documents.

  4. For dense retrieval, configure a FAISS container per datastore index.

Development setup

Requirements

  • Python 3.7+
  • Docker
  • Make (optional)

Python requirements via pip (ideally with virtualenv):

pip install -r requirements.txt

... or via conda:

conda env create -f environment.yml

Docker containers

We use Docker containers for:

  • Elasticsearch
  • Traefik

Additionally, the FAISS storage for each datastore index requires its own container. Check the FAISS configuration section for more.

Everything can be started via Docker Compose:

docker compose up -d

And teared down again after usage:

docker compose down

API server

Configuration: Before starting the server, a few configuration options can be set via environment variables or a .env file. See .env for an example configuration and app/core/config.py for all available options.

Running:

make run

By default, the server will run at port 7000.

Check http://localhost:7000/docs for interactive documentation. See below for uploading documents and embeddings.

Tests

Run integration tests:

make test

Run API tests (does not require dependency services):

make test-api