Set stopwords in a weaviate collection
tl;dr: Here’s the solution: update the collection configuration with
collection.config.update(
# Note, use Reconfigure here (not Configure)
inverted_index_config=wvc.Reconfigure.inverted_index(
stopwords_additions=["le", "la", "il", "elle"]
)
)
and check with
client.collection.get()
Specify stopwords in a weaviate collection with the Python V4 API
I work on a French corpus composed of Moliere’s plays and use weaviate as a vector database to store the embeddings. Weaviate also does the vectorization of the text provided you specify a vectorizer.
An important element when embedding text is of course the presence of stopwords. This is particulalry important when dealing with long chunk of text where stopwords tend to obsfucate the meaning, the signal contained in that text.
Weaviate has recently released a V4 of its python API. And the documenttation is not yet finished (as of 12/2023).
In terms of stopwords the default set is predefined to the English language. There is no set available for French, Spanish or any other non-English language.
The way to define a specfic list of stopwords is to set it explicitly with the stopwords_additions
parameter when creating the collection for the Python V4 API. The Python V3 API documentation on collection / class creation is quite exhaustive .
So the other day I find myself stuck trying to add that list of French stopwords but not finding the related documentation.
I found the solution after a good night’s sleep.
Creating a collection in Weaviate goes as follows
collection = client.collections.create(
name= <collection_name>,
vectorizer_config=vectorizer,
generative_config=wvc.Configure.Generative.openai(),
properties=[
< list of properties>
]
)
where vectorizer can be the OpenAI’s ada-02 model
import weaviate.classes as wvc
vectorizer = wvc.Configure.Vectorizer.text2vec_openai(vectorize_class_name = False)
I’m using a local install of weaviate so the client is instantiated with
import weaviate
client = weaviate.connect_to_local(port=8080, grpc_port=50051,
headers={ <specify your keys to OpenAI, Huggingface etc > }
)
Specify French Stopwords
To specify stopwords you need to use the wvc.Configure.inverted_index(**params)
function with the right parameters.
All the values below are the default ones except for the stopwords_additions
key
params = {
"bm25_b": 0.75,
"bm25_k1": 1.2,
"cleanup_interval_seconds": 60,
"index_timestamps": False,
"index_property_length": False,
"index_null_state": False,
"stopwords_preset": None,
"stopwords_additions": list_stopwords,
"stopwords_removals": None,
}
Then create your collection setting inverted_index_config = wvc.Configure.inverted_index(**params),
collection = client.collections.create(
name= <collection_name>,
vectorizer_config=vectorizer,
generative_config=wvc.Configure.Generative.openai(),
inverted_index_config = wvc.Configure.inverted_index(**params),
properties=[
< list of properties>
]
)
You can also not specify the inverted_index_config
parameter when crearing the collection, to update later the config with
collection.config.update(
# Note, use Reconfigure here (not Configure)
inverted_index_config=wvc.Reconfigure.inverted_index(
stopwords_additions=["le", "la", "il", "elle"]
)
)
To check that the config has been updated, you can list and check the collection config with
client.collection.get()
In my case, with a small list of French stopwords this returns
_CollectionConfig(name='Test', ...,
inverted_index_config=_InvertedIndexConfig(
bm25=_BM25Config(b=0.75, k1=1.2),
...
stopwords=_StopwordsConfig(
preset=<StopwordsPreset.EN: 'en'>,
additions=['il', 'elle', 'je', 'tu', 'nous', 'vous', 'ils', 'elles'],
removals=None
)),
<some other info>
)
Notice that although I specified "stopwords_preset": None,
in the params, the preset stopwords has still been set to <StopwordsPreset.EN: 'en'>
. I could not find a way to set the preset to None. Although this is a valid value as specified in the documentation. So it’s expected that both the predefined set of English stopwords and the added list of French stopwords will be removed from the teext to embed.