Alexis Perrier

Fast scoring of a RAG pipeline over Q&A baseline

2023-12-10T14:00:00+00:00

Fast scoring of a RAG pipeline over Q&A baseline

tldr;

RAG is a simple technique to explore a large corpus of documents that the LLM does not know about. But RAG makes a bold assumption: the answer to the query is present in the matching chunks of text. This is not always the case when answers are spread out across the entire text.

Another issue with RAG is the difficulty in evaluating its performance. Multiple elements of the pipeline have a significant impact on the relevance of the answers to our queries. Similarly to standard experiment loop in machine learning, we need to be able to evaluate a RAG with a simple unique score.

In this post, we add some steps to the RAG pipeline to tackle both the problem of close perimeter contained answer assumption and the evaluation method. We use Langchain and Weaviate.

RAG

[todo:what is RAG and what is it used for]

Given a large corpus composed of 100s, 1000s of textual files we want to explore the corpus in natural language. We have a question Q.

A standard RAG process goes as follows:

The embedding / preparation part:

Chunk the text
Vectorize each chunk => embeddings

The retrieval part, given a query:

vectorize the query => query embedding
find the chunk(s) closest to the query embedding

And finally the generative part:

insert the matching set of chunk(s) into a prompt
use that prompt and a generative model to answer the initial question.

Note that the process requires 2 LLM models: an embedding one and a generative one.

In short we use a short extract of the one of the document in the corpus as context within the generative prompt to help the generative model answer the queston.

[short exemple]

We therefore make the assumption that the answer to the query is contained in the set of matching chunks given to the generative prompt as context of information. However that is not always the case.

Coherence assumption: The information is coherent or cohesive.

Consider for instance, a report on a public debate from some institution. The debate happened very recently and the generative model cannot know its content. The available report comes down to a 50 page pdf file. Too long to be given as a whole to the generative prompt.

Nopw, consider the question: who are the contributors to the debate ?

A simple query that can be solved with a simple python script. If the document is properly structured finding the answer does not even require any NER (named entity recognition) task. A simpl regex could work.

The RAG system will find a set of matching chunks or paragraphs from the pdf report. And assumming the prompt is efficient, the outpuanswert will consists of the contributors mentionned in the set of chunks. Obviously missing some contributors that happen to be mentionned in other parts of the report.

Now consider more abstract queries related to ideas, arguments or opinions that can be spread out over multiple chunks. The problem becomes more critical.

RAG scoring and evaluation

The power of machine learning comes from improving the model by using a tight, fast, try - score loop. The data scientist establishes a baseline, then experiments with different settings, models or data and scores each experiment. Iterativity is central to the process. Having a unique score to evaluate the performance of the model is key for this iterative process of fast iterations and improvements to take place.

In the context of LLMs with RAG or other applications, evaluating the relevance of the output, the answer is not so straightforward. A manual evaluation is time consuming and possibly biaised.

To remedy this problem and to be able to implement concise experiment loops, we need to add an extra step to the RAG pipeline:

Benchmarking a RAG with Q&A generation

Consider chunks from the original corpus.
build a Q&A set: on each chunk, we can ask the LLM to come up with a list of Questions and Answers.
For each pair of Q&A, we can derive a proximity score from their respective embeddings.

This gives us a testing set on which we can evaluate any RAG setup and experiment.

We can expand on this benchmarking step with:

a manual review of the questions the LLM came up with to weed out irrelevant questions
asking the LLM for Q&A of different level of abstractions: from simple facts, information extractions or stylistic analysis to more abstract questions on ideas, intent or arguments.

The assumption made here is that the distance, the score between the Q and A embeddings is a good proxy for the relavance of the RAG-generated answer from that corpus given a question. This is after all the main feature of embeddings, their ability to integrate meaning.

Let’s go

The corpus

Consider a corpus composed of text on Moliere, his life and some of his plays. The texts are in French and related to the play l’Avare, or the Miser iin English.

The corpus is composed of 5 documents obtained from these urls

We use trafilatura to download the text versions of these pages. And edited each file so that paragraphs are separated by a double line return “\n\n”.

Chunk, chunk baby chunk!!

Although chunking seems like a simple task, there are many flavors. Langchain implements several text splitting methods : tiktoken via OpenAI, NLTKTextSplitter or a GPT2TokenizerFast via Huggingface andd a Recursive Text Splitter. https://api.python.langchain.com/en/stable/text_splitter/langchain.text_splitter.CharacterTextSplitter.html#langchain.text_splitter.CharacterTextSplitter.from_huggingface_tokenizer

The respective behavior of these text splitters and in particular the impact of the chunk_size and chunk_overalp parameters can be difficult to grasp.

Here’s a good Stack Overflow explanation on that topic.
And here is a streamlit based demo app that compares some of the different chunking methods.

In the end, we found that the separator parameter set by default to “\n\n” is what matters most in how the chunks are produced.

Establish all the baselines !!!

To establish the baseline Q&A, we need chunks with enough information for the LLM to be able to generate questions and associated answers.

We choose the tiktoken splitter with chunks of 600 tokens, overlapping by 100 tokens and we set the separator to the default value, a double line return “\n\n”.

        from langchain.text_splitter import CharacterTextSplitter
        text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
            chunk_size= 600,
            chunk_overlap = 100,
            separator = "\n\n",
        )

This gives us between 5 and 25 chunks for the documents.

Let’s generate the Q&A.

For each chunk we ask the LLM to write a set of question and answer given the text. We could directly use the langchain QAGenerationChain class. But the langchain prompts are in english and the chain may be unstable when trying to produce valid json. So we’ll implement our own chain.

We will take inspiration from the langchain prompts for QA generation and use these French ones instead

template = """
    Tu es un professeur de français au lycée.
    Tu dois écrire une paire de question response pour une interrogation écrite afin de tester les connaissances de tes élèves.

    Ta réponse dois suivre le format JSON suivant
    ```
    \{\{
        "question": "$LA_QUESTION_ICI",
        "réponse": "$LA_REPONSE_ICI"
    \}\}
    ```
    Tout ce qui est entre les ``` doit être du JSON valide.

    Propose une paire question/réponse, dans le format JSON spécifié, pour le texte suivant :
    ----------------
    {text}
"""```


The prompt follows the usual format
- role
- task description
- output format
- and actual task


The script generates a pair of question - answer for a given chunk.

``` python
    from langchain.chat_models import ChatOpenAI
    from langchain.prompts import ChatPromptTemplate
    from langchain.chains import LLMChain, SequentialChain

    llm_model = "gpt-4-1106-preview"
    prompt_context = ChatPromptTemplate.from_template(template)

    llm = ChatOpenAI(temperature=0.9, model=llm_model)

    chain   = LLMChain(llm=llm, prompt=prompt,  output_key="response",    verbose=False)

    overall_chain = SequentialChain(
        chains=[chain],
        input_variables=["text"],
        output_variables=["response"],
        verbose=True
    )

    response = overall_chain({ "text": chunk })
    print(response['response'])

The output is satisfying.

Here’s a chunk (translated into English via Deepl.com):

    In the first place, the moralist dramatist evokes the prodigality of the son, who of course reacts to his father's penny-pinching.
    But in his rejection of his father's excesses, Cléante also indulges in excess.
    Molière pokes fun at young people's propensity to follow expensive fashions and live in style.
    Not without a touch of humor, Molière places in his father's mouth a reproach of precious style:
    "you give furiously in the marquis".
    It's true that wigs and ribbons are overpriced by the merchants who take advantage of the windfall.

And the generated Q&A is:

{
    "question": "According to the text, what vice does Molière criticize in the son Cléante in addition to the father's avarice?",
    "answer": "Molière criticizes Cléante's profligacy. He mocks his tendency to follow expensive fashions and lead an expensive lifestyle."
}

Good job GPT4 !

We still need to score these pairs.

Since we’re working on a French corpus, we have a decision to make. Either use an English centered embedding model such as OpenAI’s ada-02 or a French flavored one such as camembert or flaubert.

For the sake of simplicity of implementation we will use the OpenAI’s ada-02. Prior manual comparison in the retrieval phase related to Moliere’s plays of ada-2 vs camembert did not show a significant advantage from using the French based model. For now anway the important thing is the consistency of the process. We will evaluate the embedding model in a future experiment. However, being able to evaluate the embedding model is what triggered this work in the first place.

Let’s use the weaviate vector store and database to score the Q&A.

We will:

store the questions in a Question collection
store the answers in a Answer collection

along with the original document names as meta information.

weaviate scoring

When searching the vector db for the proper answer to each question weaviate returns not only the answer from the original pair but also a matching score.

Weaviate offers 3 scorings: score, distance and certainty. see this page. By default the distance is the cosine distance.

So next steps is to create Question and Answer collections with the following columns

In the Answer collection, the answer property is vectorized, whereas in the Question collection, the question propertiy is vectorized

Answer collection

original doc name (not vectorized)
answer (vectorized)

Question collection

original doc name (not vectorized)
question (vectorized)

Note: The question of optimality of the baseline is unknown at this stage. since the questions and answers are derived directly from the same chunk, the distance should be minimal. But since we’re daeling with LLMs we may find a way to get better QA matches with a different setup. Also note that the optimal, lower distacne answer to a question is if the answer is the same as the question. pffff mind blown!

Since we took large enough chunks, each chunk may contain more information than the one addressed by the question. So their embeddings should be further away from the question than if the chunk only contained the expected information. So it’s possible that the score of the Q,A given c is not optimal (minimal distance between Q and A).

Math notations

I always feel that putting a problem into equations clears up things. But feel free to skip that paragraph and go to the implementation part.

Q&A Baseline

Consider

a set of chunks $\{c\}_C$ extracted from a corpus obtained via a text splitter called $C$ .
an embedding from text to vector called $E$
an LLM model $M$ and a prompt $P_{QA}$

From the set of chunks $\{c\}_C$, we derive:

a set of Q&A pairs, $\{(Q,A)\}$,
their respective embeddings $\{( E(Q), E(A) )\}$
and the distance between each embedding pair $\{ d( E(Q), E(A) ) \}$

The baseline is the set composed by questions, answers, their embeddings and the distance

baseline : $\{ (c, Q, A, E(Q), E(A), d( E(Q), E(A) ) \}$

RAG Process

Consider now a RAG pipeline that we want to evaluate. The RAG pipeline has a text splitter $C'$, embedding model $E'$, LLM $M'$ and prompt $P'$ for the generative part.

A RAG pipeline is fully defined by the combination of these elements $C', E', M', P'$.

The RAG process can be written as the sequence of steps:

extract set of chunks ${c'}$ using text splitter $C'$
embed chunks ${c'}$ using $E'$

Retrieval step then consists of

For a genuine user query $Q$,

embed the user query $Q$ using $E'$
Given $Q$, find the closest chunk $c'$

Generation step

using prompt $P'$ and LLM $M'$ , derive the answer $A'$ such that $A' = P'_{M'}(Q, c')$

Performance evaluation of the RAG pipeline

By applying the RAG process to our set of baseline questions $\{Q\}$, we obtain the set of distances between the questions ${Q}$ and the generated answers according to the Embedding model $E'$.

For each $Q$ from the set of Baseline question / answer pairs, we have

${A'}$ : $d(E'(Q), E'(A'))$.

The experiment results consists of the set $\{ (c', Q, A', E'(Q), E'(A'), d( E'(Q), E'(A') ) \}$

We can therefore evaluate the performance of the RAG pipeline $C', E', M', P'$ by comparing the baseline and the experiment distances :

\[S_b = \{ d( E(Q), E(A) \}\]

\[S_e = \{ d( E'(Q), E'(A') \}\]

where $S_b$ and $S_e$ are the set of scores obtained for the baseline and the experimental / evaluated RAG

Now 3 things can happen:

$S_b ~= S_e$ The experiment generates answers that are as close to the questions as the answer directly generated from the text. This can be interpreted as a good result.
$S_b > S_e$ The experimental RAG pipeline finds answers to the original questions that are closer than the answers derived directly from the corpus. This also indicates a good result.
$S_b < S_e$ the baseline scores are lower than the experiment RAG scores. The experiment answers are not as close to the baseline answers. More work is required be considered to improve the performance of the RAG pipeline.

Implementation

Let’s take that for a spin

We already have a baseline set of Q&A. Now we need embeddings and distances.

We’re going to use weaviate as a vector store. Weaviate provides

data storage for embeddings and data
fast retrieval of closest vectors
the distance between a query and a vectorized text during the retrieval step
and a simple API for all that

A collection is the equivalent of a table in a standard SQL database.

A collection is composed of properties (columns). You can specify which properties are vectorized or not. Properties that are vectorized are concatenated together before the vector is produced. See Configure semantic indexing for details.

We will create 2 collections, one for the answers from the baseline Q&A and one for the answers from the experiment.

We use a local installation of the datastore. See How to install Weaviate

Stay tuned for the rest of the implementation of the pipeline. This is a work in progress and some details need to be ironed out before I can publish the code.

Expectations, assumptions and … results

Just to recap the main expectation for this evaluation method is to create a baseline of Q&As in order to score a given RAG pipeline composed of the 4 elements: chunkizer, vectorizer, generative model and prompt. This gives us a way to optimize the pipeline by changing these elements and monitoring the pipeline score.

distance is not always a good measure of meaning and relevance

However, all this rests on another assumption, a huge one in fact. The underlying assumption is that the distance between embeddings of the baseline questions and the different answers produced by the RAG pipelines measures the relevance of the answer with regard to the question.

This works in simple cases where the answer is short and very specific.

good answer => lower distance:
- Q: what is the capital of France
- A: Paris
wrong answer => higher distance:
- Q: what is the capital of France
- A: London

But when the answers are more complex, the distance will be very dependent on the occurences of the words within the question and the answer.

for instance

absurd answer but low distance score:
- why is Paris a good tourist destination
- Paris is a good tourist destination because my uncle is rich

Most of the words in the question are also in the answer and the resulting distance will be lower than for thiis answer

- why is Paris a good tourist destination
- The city offers a great selection of restaurants and museums

The retrieved chunk can be misleading

The RAG method also assumes that the retrieved chunk is pertinent to the question. This is the repsonsibility of the retrieval phase, to find the most relevant chunk of text with regard to the question.

But when the retrieved chunk does brings the wrong information, the generated answer will not be correct.

Since retrieval is highly dependant on the wording of the question, nothing ensures that the retrieved chunk works as expected.

For instance, consider a debate on a hot topic where chunks can either reflect a position P or its contrary N.

depending on the words in the question, the retrieved chunks will lean towards one side P or the other N. And the generated answer will reflect the bias in the chunks.

This happened a few times in my experiment when the chunks were the summary of the different acts of the play and the question was about the reasons behind the nehavior of one of the character of that play in a given act.

“In act I, why does Harpagon quarrel with his son.””

As the plot was evolving between acts, but Harpagon kept quarreling with his son but for different reasons , depending on the summary of the act that was retrieved the answer ended up different.

Finally, depending on the wording of the wrong answered it ended up with a better score than a more relevant answer.

Measuring the impact of the chunk

On the positive side, the method allows to estimate the impact of using the extra information as context within the answer generating prompt. All ogher thing sbeing equal, the answer generated by the prompt with the chunk had lower distance to the question than the answer generated without the chunk

Stopwords in Weaviate

2023-12-07T14:00:00+00:00

Set stopwords in a weaviate collection

tl;dr: Here’s the solution: update the collection configuration with

collection.config.update(
    # Note, use Reconfigure here (not Configure)
    inverted_index_config=wvc.Reconfigure.inverted_index(
        stopwords_additions=["le", "la", "il", "elle"]
    )
)

and check with

client.collection.get()

Specify stopwords in a weaviate collection with the Python V4 API

I work on a French corpus composed of Moliere’s plays and use weaviate as a vector database to store the embeddings. Weaviate also does the vectorization of the text provided you specify a vectorizer.

An important element when embedding text is of course the presence of stopwords. This is particulalry important when dealing with long chunk of text where stopwords tend to obsfucate the meaning, the signal contained in that text.

Weaviate has recently released a V4 of its python API. And the documenttation is not yet finished (as of 12/2023).

In terms of stopwords the default set is predefined to the English language. There is no set available for French, Spanish or any other non-English language.

The way to define a specfic list of stopwords is to set it explicitly with the stopwords_additions parameter when creating the collection for the Python V4 API. The Python V3 API documentation on collection / class creation is quite exhaustive .

So the other day I find myself stuck trying to add that list of French stopwords but not finding the related documentation.

I found the solution after a good night’s sleep.

Creating a collection in Weaviate goes as follows

collection = client.collections.create(
    name= <collection_name>,
    vectorizer_config=vectorizer,
    generative_config=wvc.Configure.Generative.openai(),
    properties=[
        < list of properties>
    ]    
)

where vectorizer can be the OpenAI’s ada-02 model

import weaviate.classes as wvc
vectorizer  = wvc.Configure.Vectorizer.text2vec_openai(vectorize_class_name = False)

I’m using a local install of weaviate so the client is instantiated with

import weaviate
client = weaviate.connect_to_local(port=8080, grpc_port=50051,
            headers={ <specify your keys to OpenAI, Huggingface etc > }
)

Specify French Stopwords

To specify stopwords you need to use the wvc.Configure.inverted_index(**params) function with the right parameters.

All the values below are the default ones except for the stopwords_additions key

params = {
    "bm25_b": 0.75,
    "bm25_k1": 1.2,
    "cleanup_interval_seconds": 60,
    "index_timestamps":  False,
    "index_property_length":  False,
    "index_null_state":  False,
    "stopwords_preset": None,
    "stopwords_additions":  list_stopwords,
    "stopwords_removals": None,
}

Then create your collection setting inverted_index_config = wvc.Configure.inverted_index(**params),

collection = client.collections.create(
    name= <collection_name>,
    vectorizer_config=vectorizer,
    generative_config=wvc.Configure.Generative.openai(),
    inverted_index_config = wvc.Configure.inverted_index(**params),
    properties=[
        < list of properties>
    ]
)

You can also not specify the inverted_index_config parameter when crearing the collection, to update later the config with

collection.config.update(
    # Note, use Reconfigure here (not Configure)
    inverted_index_config=wvc.Reconfigure.inverted_index(
        stopwords_additions=["le", "la", "il", "elle"]
    )
)

To check that the config has been updated, you can list and check the collection config with

client.collection.get()

In my case, with a small list of French stopwords this returns

_CollectionConfig(name='Test', ...,
inverted_index_config=_InvertedIndexConfig(
    bm25=_BM25Config(b=0.75, k1=1.2),
    ...
    stopwords=_StopwordsConfig(
        preset=<StopwordsPreset.EN: 'en'>,
        additions=['il', 'elle', 'je', 'tu', 'nous', 'vous', 'ils', 'elles'],
        removals=None
    )),
    <some other info>
)

Notice that although I specified "stopwords_preset": None, in the params, the preset stopwords has still been set to . I could not find a way to set the preset to None. Although this is a valid value as specified in the documentation. So it’s expected that both the predefined set of English stopwords and the added list of French stopwords will be removed from the teext to embed.

Chunking will make or break your RAG!

2023-12-01T14:00:00+00:00

Chunking will make or break your RAG!

This short post explains how to explore a large proprietary corpus using a RAG strategy and emphasizes the importance of the chunking step of the pipeline.

RAG time

(no, not the Scott Joplin Ragtime, sorry jazz lovers)

RAG stands for Retrieval Augmented Generation. It is a technique used to question and explore large collections of documents. RAG is a simple way to leverage LLMs on a proprietary corpus without resorting to more expensive and complex tuning strategies.

The initial preparation step consists in splitting the documents from your corpus into smaller parts called chunks. Chunking breaks down large text into smaller segments to optimize content relevance.

Chunking is followed by the embedding phase: computing a vector representation aka embedding of each chunk . Each text chunk becomes a vector.

There are multiple ways to chunk a document and compute embeddings.

These embeddings are then stored in a vector database such as weaviate. A vector database has 2 main roles: 1) storing the text and related vectors 2) enabling a super fast matching between 2 embeddings.

In short:

text is split into chunks
each chunk is vectorized and stored

Now we can start querying your corpus.

As the name indicates, the RAG pipeline then consists in 2 steps: retrieval and generation.

Retrieval

Given your query or question.

Your query is also embedded.
The vector database finds the chunks of text within the corpus that most closely matches your query.

Generation

The resulting chunks are used as the information that can help answer your question using an LLM and a properly structured prompt.

The prompt has the following structure: role, context and query:

Acting as
{insert specific role: teacher, analyst, nerd, author, ...}

Use the following information:
{insert resulting text chunks}

to answer this question:
{insert your initial question}

This is the overall RAG strategy. Whether it works or not will depend on multiple factors.

The embedding model : the choice of models is huge. For instance, Camembert via huggingface is supposedly more adapted to French, text-embedding-ada-002 is OpenAI’s embedding model, Voyage is another embedding model that shows potential.
The generative LLM: open source vs OpenAI GPT4, Cohere, Anthropic
And the chunking strategy

Chunking can be done at the sentence level, paragraph or be fixed size (see langChain TextSplitters for instance). Some overlap between chunks is usually considered to liaise consecutive chunks. Although the other steps (embedding, LLM) are worthy of attention, finding a good chunking strategy is where the challenge lies in order to get relevant answers.

A good chunking strategy must capture both meaning and context. Short chunks will preserve meaning but lack context while long chunks will tend to smooth out nuances of each sentence.

“Embedding a sentence focuses on its specific meaning, while embedding a paragraph or document considers overall context and relationships between sentences, potentially resulting in a more comprehensive vector representation but with the caveat of potential noise or dilution in larger input sizes.” Thx, chatGPT!

So when defining a good chunking strategy, keep in mind:

The length of the initial documents: books or short sentences and the expected length and complexity of user queries.
Different models may perform optimally on specific chunk sizes: sentence-transformer models for individual sentences, text-embedding-ada-002 for chunks with 256 or 512 tokens

The idea is to align the chunking strategy with the user queries to establish a closer correlation between the embedded query and the embedded chunks.

Conclusion

Chunking is not a challenge that can be solved with an AI model. A simple python script will do. It’s a simple task that boils down to splitting large text in smaller parts. But it is the foundation that will make your RAG system generate quality answers.

Who is Moliere mocking?

2023-11-10T14:00:00+00:00

Who is Molière mocking?

tldr;: the enduring appeal of Molière’s plays lies in their ability to serve as mirrors reflecting universal human flaws and shortcomings. The humor in Molière’s characters, often perceived as mockery, actually prompts self-reflection and self-awareness, offering a form of absolution for common human imperfections.

One aspect of teaching Molière’s plays in schools is based on the archetypal nature of his most caricatured characters. The archetype of a given subject relies on a strong, characterized, recognizable representative image.

According to textbooks, the humor in Molière’s plays comes from the absurdity of characters with exaggerated traits. They are old, irritable, authoritarian, but above all, obsessive, often gullible, and always ridiculous. Harpagon for his greed, Géronte for his intransigence, Argan for his hypochondria and unwavering faith in quack medicine. And many others.

The spectator or reader laughs at the expense of the character, the target of their judgment, and constructed for this purpose by the author. A somewhat ungenerous laugh, a schoolyard laugh where we mock the weaker, the dumber, the older, the more ridiculous with a certain amount of malice.

But this type of humor cannot explain the timelessness of Molière’s works. Ridiculing others is childish and bitter. It destroys empathy.

If we return to the notion of archetype as an essential and universal trait of the collective unconscious Jungian archetypes, it seems far more likely that we are actually laughing at ourselves, at our less admirable traits and buried faults. It’s a nobler, thankfully, self-deprecating laughter that shines a light on our internal psychology. The archetypes in Molière’s plays are a mirror. We revisit and reread and continue to teach Molière because we are the subjects of his plays.

Who among us has not been obsessed with overly meticulous accounting, among friends, while shopping, …?
Who among us has not scrutinized their slightest ailment, pain, button, or internal rumble, interpreting it as the inevitable sign of an future cancer?
Who among us has not been too rigid, strict, uncompromising, or incapable of empathy towards loved ones at one time or another?
Who among us has not been credulous in the face of a proposal that turned out to be too good to be true?

But failing to really admitting it clearly afterward. Although not that dreadful, these behaviors don’t really highlight us.

Reading these Molière plays carefully in order to adapt them into modern French, I realize with amusement that these caricatured characters also exist within me. I find myself, unintentionally, playing these famous scenes in real life. The quarrel scene between Sganarelle and his wife Martine is a recent good example. I am also not a stranger to nitpicking over pennies from time to time.

Molière’s humor sometimes borders on the grotesque, with scenes of beatings or vaudeville with situations. But what makes Molière’s plays such classics, centuries after their publication, is primarily, it seems to me, their role as beneficial mirrors. They offer us absolution for our very human lapses.

This is a GPT supported translation of De qui Moliere se moque-t-il ?

Rewriting Moliere plays with chatGPT

2023-10-04T14:00:00+00:00

Rewriting Moliere plays with chatGPT

tldr: I used GPT4 to generate modernized versions of Molière’s works. The goal is to facilitate access to these works which are omnipresent in the French school curriculum while vernacular French is evolving rapidly. By simplifying the structure of sentences, reducing the length of lines, and modernizing the vocabulary, we obtain simplified versions while maintaining the integrity of the original works.

LLMs (Language Models) like chatGPT have disrupted learning and teaching in just a few months.

It’s a known fact that LLMs hallucinate. An LLM does not know the subject it handles. The model can write a recipe but has never tasted chicken.

For example, when you ask chatGPT to summarize “Les Fourberies de Scapin” (Scapin the Schemer), you get a text that is certainly inventive but fundamentally false, seemingly containing pieces from “Le Médecin Malgré Lui.” (the Doctor despite himself):

“Scapin stands in the middle of the courtyard, in front of a large wooden crate, pretending to be a ‘doctor’ or a ‘sorcerer’ capable of curing the lovers’ woes.”

This will surely get you a zero, or F, from the French teacher.

Therefore, the relevant question is not the accuracy of the produced text but its plausibility. The strength of LLMs lies in their ability to generate text, not in aligning facts. Among the many use cases, these models can facilitate the understanding of classical texts by offering a simplified version of potentially challenging-to-understand texts.

Why Tackle Molière?

Molière’s plays are ubiquitous in the French school curriculum. Molière is hailed as the embodiment of French genius in terms of theater, comedy, and also as a social critic whose modernity can never be questioned. His characters, Harpagon, Scapin, Sganarelle, are known to all French. His best lines are ingrained in the French psyche.

“One must eat to live, not live to eat,” “But what was he doing in that galley,” and many others.

The texts of these 17th-century works have already been adapted into French at some point. The raw texts from the 1660s would be quite challenging to understand today. However, this version is aging rapidly. It has become difficult for young (and not-so-young) generations to grasp. Comments on social media abound about the difficulty of understanding the texts and therefore the story.

I’m in 9th grade, and my French teacher asked me to read this book, but I didn’t understand anything.

Hence the idea of using chatGPT to translate and adapt Molière’s plays into modern French.

Our goal is to make these works easier to read and understand for a population that is increasingly less accustomed to reading books, and whose everyday French is rapidly diverging from the dusty standards of the French Academy.

Betrayal, stupefaction and stupidification !

Daring to mention the modernization or simplification of Molière’s texts instantly provokes a visceral and indignant rejection. The central accusation being the dumbing down and its corollary, the inexorable decline in educational standards. These arguments reek of “it was better in the past,” the anti-screen brigade, the agony of French in the face of English, and other fallacies about the education of yesteryears.

Simplifying the text would supposedly hasten a decline in students’ standards.

However, correlation (assumed) does not imply causation.

The Bible has been translated, Shakespeare has been adapted into contemporary English, so why can’t Molière be as well? The texts are not sacred. The approach is democratic. And the goal is clear: to facilitate access to classical theater plays.

To translate, adapt, modernize, certainly, but what are we talking about? What results can we expect?

Let’s clarify right away. This is not about summarizing the play, nor excessively simplifying the texts, and certainly not about making them sound youthful with supposed youth language.

Our goal, therefore, will be to:

Simplify the sentence structure.
Shorten overly long lines of dialogue.
Use contemporary vocabulary.
Refresh the phrases and styles.
Transition from formal to informal language (tu <-> vous) when it makes sense, such as between parents and children.

We will preserve:

The meaning, the message, the story.
The dialogue structure between the characters.
A one-to-one equivalence of lines between the original and the modernized version. Each line, each verse, should have an equivalent in modern French.
Iconic lines: “Mais qu’allait-il faire dans cette galère,” “Au voleur! au voleur! à l’assassin! au meurtrier! …”

And, of course, we won’t hesitate to keep the original text when it holds no particular difficulty.

An Example

Let’s take an example. In Act 1, Scene 1 of L’Avare (The Miser), Valère opens the play with these words:

« Hé quoi ! charmante Élise, vous devenez mélancolique, après les obligeantes assurances que vous avez eu la bonté de me donner de votre foi ? Je vous vois soupirer, hélas ! au milieu de ma joie ! Est-ce du regret, dites-moi, de m’avoir fait heureux ? et vous repentez-vous de cet engagement où mes feux ont pu vous contraindre ? »

“What! Lovely Élise, you are becoming melancholic, after the obliging assurances you had the kindness to give me of your faith? I see you sigh, alas! in the midst of my joy! Is it regret, tell me, for having made me happy? And do you repent of this commitment where my passions may have compelled you?”

It’s beautiful, flowery, and delightfully romantic!

The modern version reads:

« Pourquoi cette tristesse, Élise, après m’avoir assuré de ton amour ? Je te vois soupirer, est-ce du regret de m’avoir rendu heureux ? Regrettes-tu notre engagement ? »

or in English:

“Why this sadness, Élise, after assuring me of your love? I see you sigh, is it regret for making me happy? Do you regret our commitment?”

It’s less beautiful but way more straightforward.

In this example, we touch upon one of the main challenges of the exercise. The beauty of the original text, its style, rhythm, and tensions, all fall into the realm of flavor and music. There is poetry in these lines, even though the text is in prose. The modern version is much more plain in comparison, but it offers conciseness and clarity. What is lost in beauty is gained in efficiency.

The Quest for the Prompt

With GPTs and LLMs, the prompt is everything. Without a prompt, there is no salvation. The prompt will dictate the quality of the result: form, format, style, and, most importantly, the preservation of meaning. We have worked with two models: GPT 3.5 and GPT 4 via the openAI API and tested numerous configurations and prompts.

Automating Prompt Engineering

The classic process of optimizing a machine learning model involves defining a metric to maximize by selecting the best model’s meta-parameters. This model must also be robust, meaning it performs consistently in the face of slight variations in input data. This iterative process allows testing multiple configurations and achieving the best possible results based on context, approach, and available data.

Such a process would give us a systematic approach to finding the best prompt. However, it remains challenging to implement in our text transformation context (Automatic Text Simplification (ATS)). This is for two reasons.

Firstly, the inherently random nature of LLMs makes the results inconsistent. The nature or quality of generated text varies depending on API request parameters, the prompt, and the input text to be modified. Even when setting the model’s temperature to zero and using an identical prompt, we cannot control the model’s response to incoming original text.

Secondly, complexity measures of a text are not suitable for our context of simplifying 16th-century text corpora.

We have used the LyngX library, which offers several psycholinguistic complexity metrics (DNT, IDT, …). Unfortunately, there appears to be no correlation between simplified lines and complexity scores obtained with these methods.

At this stage, building an automation for prompt selection for automatic text simplification seems to require more effort than we can afford. Our primary goal remains to quickly have publicly available modernized versions of Moliere’s plays.

For that reason we opted for a manual selection of prompts, models, and query parameters. In the end, after many trials, our prompts follow the format:

For example:

Rewrite the text in modern French:
- Basic vocabulary;
- Clear and short sentences;
- Reduce the paragraph length;
{text}

Write this text in simple and concise French style:

text:
{text}

where {text} is replaced by a line of dialogue, an entire scene, or an excerpt consisting of a series of lines of dialogue.

Key Challenges

Translating line by line often results in loss of meaning since the model lacks awareness of context, or in a result in a narrative form: Géronte speaks to his son and says this instead of the dialogue. Géronte:

On the contrary, submitting each scene in its entirety leads to a reduction in the number of lines in the translated version.

This is one of the peculiarities of Molière’s texts. The ping-pong dialogues, consisting of a rapid exchange between 2 characters who repeat very similar same lines. For example, in Act II, Scene 4 of Le Médecin Malgré Lui:

GÉRONTE: Vous donner de l'argent, Monsieur.
SGANARELLE: Je n'en prendrai pas, Monsieur.
GÉRONTE: Monsieur...
SGANARELLE: Point du tout.
GÉRONTE: Un petit moment.
SGANARELLE: En aucune façon.
GÉRONTE: De grâce!
SGANARELLE: Vous vous moquez.
GÉRONTE: Voilà qui est fait.
SGANARELLE: Je n'en ferai rien.
GÉRONTE: Hé!

which reads as:

GÉRONTE: To give you money, Sir.
SGANARELLE: I won't take any, Sir.
GÉRONTE: Sir...
SGANARELLE: Not at all.
GÉRONTE: Just a moment.
SGANARELLE: In no way.
GÉRONTE: Please!
SGANARELLE: You're joking.
GÉRONTE: There you go.
SGANARELLE: I won't do it.
GÉRONTE: Hey!

As the content varies little from one line to the next, the model, which has just been instructed to simplify the text, will reduce the number of lines. This leads to the loss of the valuable one-to-one equivalence between the original and modern versions.

We eventually opted for a middle ground between translating line by line and translating entire scenes, using a sliding window of 5, 10, 15 lines with an overlap of 2, 5, or 7 lines between queries. This provides the model with enough context to avoid the problems mentioned earlier.

However, this entails reviewing the different versions obtained to select the best one for each line in terms of the meaning discussed earlier. There is a degree of subjectivity in this selection process, which then becomes a traditional review and editing task.

In the end, for the play “Les Fourberies de Scapin” (771 lines), we kept ⅓ (232 lines) in their original version, manually rewrote 47 lines (6%). The remaining two-thirds were equally contributed by GPT 3.5 (224) and GPT 4 (268).

Results

We achieve a modern version of the texts that fulfills all the previously stated objectives: simplification and shortening of sentences, refreshing vocabulary, phrases, and styles, switching from formal to informal language (tu/vous), all while preserving the meaning and maintaining a 1-1 equivalence.

The available plays currently include:

Scapin the Schemer: Les Fourberies de Scapin
The doctor despite himself: Le Médecin Malgré Lui
The Miser: L’Avare

The texts are displayed in bilingual mode, with the revised text side by side with the original, allowing for comparison and reading of the modern version without losing the original version’s rhythmic and humorous qualities.

Intro to NLP, new course on Openclassrooms

2021-02-02T14:00:00+00:00

A new Intro to NLP course

[2023 update] The course has been revamped and updated with new content. It’s an intro to NLP course with a focus on Enbeddings. We cover what is now old tech (Word2Vec) but which is still very relevant in the age of attention and transformers. If you’re new to NLP this is a good way to start and get to work with SpaCy, NLTK and other standard NLP librairies.

[2021]

Intro to NLP course on Openclassrooms.

This could not have happened without OC’s amazing team, with a special shoutout to Alexandra. :)

The course covers basic BOW to static embeddings, glove style, with NLTK, Spacy and Gensim.

Part #1 - Preprocess Text Data
1. Build Your First Word Cloud
2. Remove Stop Words From a Block of Text
3. Apply Tokenization Techniques
4. Create a Unique Word Form With SpaCy
5. Extract Information With Regular Expression
6. Quiz: Preprocess Text Data
Part #2 - Vectorize Text for Classification Using Bag-of-Words
1. Apply a Simple Bag-of-Words Approach
2. Apply the TF-IDF Vectorization Approach
3. Apply Classifier Models for Sentiment Analysis
4. Quiz: Vectorize Text Using Bag-of-Words Techniques
Part #3 - Vectorize Text for Exploration Using Word Embeddings
1. Discover The Power of Word Embeddings
2. Compare Embedding Models
3. Train Your First Embedding Models
4. Bonus! Doing More with SpaCy
5. Quiz: Check Your Understanding of Embedding Techniques

I tried to make the course more interesting and engaging by working on classic texts and funky song lyrics. Among other things we study the white rabbit in Alice in Wonderland, aliens in War of the Worlds and love and swords in Shakespeare.

Enjoy

Alexis

Best practices when sharing your data analysis - Jupyter Notebooks

2020-02-15T14:00:00+00:00

Context:

You work on a large dataset let’s say over 1Gb. You do an analysis. And you want to share

the data
the script / jupyter notebook

so that other people can work / reproduce / tweak your results.

Here are a few personal tips to make things easier for the poor schmuck / schmuckette who has to read your code.

Compressed data in a bucket

Host your data on S3, google storage, Azure, dropbox etc …. whatever fits your mood as long as it can provide a unique URI.

Sharing datasets in an email, or in google drive is flaky and confusing. Drive is not the right place to host datasets. Space is limited, and access control can be hazy.

By hosting the dataset on the cloud:

your data has a unique URI. It is centralized, and you can easily enforce versions of the data.
You control who has access.
You are not limited in space.

When you share your notebook, the data is downloaded using this unique URI instead of

However, managing access permissions on specific items in the cloud can be a real pain.

By the way pd.read_csv natively reads gzipped files :). Just add the compression='gzip' parameter:

df = pd.read_csv('S3_bucket/sample.tar.gz', compression='gzip')

host your notebook on google colab or on a similar executable platform

Your script may be efficient, bug free, superbly commented out etc … but still end up only working only on your platform. I’ve had the case recently of a friend, not particularly python savvy, trying to open a 1.9 Gb text file on a windows machine and being faced with abstruse unicode errors. He was stuck. However, the same script worked like a charm on my mac.

So hosting the notebook on Google Colab will go a long way to make it reproducible without undue efforts.

Demo and Prod mode

If the data is large and running the whole notebook takes forever, it’s always a good idea to implement 2 modes: a sanbox and a production one with a simple test. Something as simple as: MODE = “demo” MODE = “prod”

if MODE == 'demo':
    # subsample the large initial dataset
else:
    # basically do nothing

# then the rest of the code and results etc ...

This way the recipient of your analysis can run the whole script quickly and start playing with the parameters and results right away instead of having to wait for loops or apply lambdas to finish.

You can choose whichever mode as the default one depending on your audience.

One operation per cell

Following the single responsibility principle, is an excellent practice when working with jupyter noteboooks.

The single responsibility principle states that every module, class, function should have responsibility over a single part of the functionality provided by the script. wikipedia single responsibility principle

This allow the user to insert other cells to explore the resulting objects and data. Very useful.

The ruby community is very strong on that single responsibility principle with excellent results in terms of bug reduction, readability and maintainability of the code.

Structure your code

The more structure the better. By default

I always import all the libraries,
then define all the functions,
load the data (from AWS S3). (subset the datsaet in DEMO mode.)
make sure it’s as expected,
explore it (df.head(), df.describe())
and finally dive into the problem at hand

From start to finish, is your notebook really running?

The main default of Jupyter notebooks is the lost state problem where a cell depends on previous runs of other cells which may have already been modified. So just making sure everything works as intended from importing the libraries to the end results is adamant before sharing.

Add a requirements file

I find this optional but that’s just my ingrained laziness. See this post for more on the subject by JakevanderPlas on Installing Python Packages from a Jupyter Notebook

super meaningful variables names.

Can’t emphasize this one enough. I often spend significant amounts of time looking for synonyms that will convey precisely the true nature of an important variable to a reader, myself included. The time gained by abbreviating any variable will be lost a thousand fold later on when trying to figure out what the variable stands for.

And do follow coding best practice such as:

Alternate code with comments specific markdown cells and data exploration cells
Keep the code DRY
comment and exploration cells
avoid loops and prefer array operations
comment and exploration cells
insert test cells dedicated to asserting that what you have is what you expect
comment and exploration cells
etc …

Comments should focus on explaining the choices made in terms of methods and parameters. Not simply rephrasing the code.

Elsewhere:

Google has a longer, more precise list of excellent best practices when working on Google Colab.

A good paper on Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks

Please drop me a line on twitter @alexip if you’d like to add something or comment an item.

Cheers!

New statistical learning course on openclassroooms

2019-10-21T14:00:00+00:00

I am very excited to announce that my new course on statistical learning is now available on openclassrooms.

Design Effective Statistical Models to Understand Your Data

In this course I explore linear, logistic and polynomial regression with hands on exercises, real-world use-cases and non trivial datasets.

Regression is the mother of all statistical models. Simple, flexible and highly interpretable.

To shine a new light on such a venerable topic, I decided to bridge the gap between the classic statistical approach and the machine learning one.

Regression in the statistical sense aims at modeling the inner dynamics of a dataset. The method uses multiple statistical tests to validate the relevance and reliability of the observations and results.

On the other hand, the machine learning approach strives to build models that perform well on previously unseen data. We no longer care about p-values, null hypothesis or statistical tests but focus instead on the performance of the trained model on new data.

The great thing is that we can use the same simple modeling techniques, linear regression to illustrate both approaches. Bridging the gap between statistical modeling and machine learning.

Here’s the outline of the course:

I - Understand the Fundamentals of Statistical Modeling
II - Build Linear Regression Models
III - Build Generalized Linear Models
IV - Build Resilient Predictive Models
1. Select the Best Predictive Model
2. Evaluate Classification Models
  - Quiz: Test Your Knowledge About Predictive Models!

The course notebooks are available on github: Notebooks for the course Design Statistical Models on OpenClassrooms

Teaching Data Science at UM6P

2019-02-15T14:00:00+00:00

Teaching Data Science is demanding, often intense, sometimes exhausting but always an enriching and extremely rewarding experience. There are magical moments of teacher-student resonance when you can feel the knowledge flowing across the room.

I had the chance of teaching a 2-week session of predictive analytics in Morocco in the fall 2018 with students of the Emines school of industrial management in their fourth and final year.

In 2018, the renowned École Polytechnique, Mohammed VI Polytechnic University and the Foundation of École Polytechnique launched a Chair in “Data Science and Industrial Processes” in Morocco which I inaugurated on the amazing UM6P campus of the Emines school of industrial management located in Benguerir Morocco. This campus was designed by the famous architect Ricardo Bofill in 2011. It is quite a sight to see and watch.

The desired outcome of this 2-week course was to give the students a deep and practical understanding of data science and machine-learning in terms of scope and tools with a focus on building a hands-on data-driven experience.

For the students to be able to work on relevant data science problems within just two weeks of teaching is a real challenge. Data science is a multi-faceted activity and several domains have to be taught in parallel: software engineering in order to build robust and reproducible scripts, data science techniques and workflows to generate reliable results, probabilistic theory underlying machine-learning algorithms to go beyond the simple copy-pasting usage of popular libraries, and most of all, an engineer can-do attitude required to tackle real world problems. Data science demands pragmatism, resilience, creativity from students and experienced data scientists alike.

The course was organized to leave a large place to interactions and Q&A between the students and the teacher and to limit the amount of time dedicated to slides-based lectures. Every time a new machine-learning concept or modelisation algorithm was introduced, it was followed by lab work using applied to datasets on increasing complexity and scope. The goal being to expose the students to real-world issues that data scientists often face in predictive analytics projects. The course slides, datasets and scripts are available on github.

In parallel, the students participated in the Iowa Housing Kaggle competition to practice their newly acquired skills in a challenging project. Their enthusiasm was impressive as exclamations of joy broke silent phases of deep focus whenever someone reached a new high score.

To have had the privilege of launching this first session of the DSIP chair with such accomplished and convivial students and engineers was a fantastic experience and a great honor. Thanks to the excellence and enthusiasm of UM6P Emines students, the learning outcomes have been fully reached.

I will have the pleasure to close this year’s data science course in March 2019 by teaching a week-long course this time focused on probabilistic programming and PyMC3.

Reduce GPU costs with startup scripts on the Google Cloud Engine

2018-02-21T14:00:00+00:00

Reduce GPU costs with on demand instances and startup scripts

This post is about leveraging on demand capabilities of costly virtual instances on the Google Cloud Engine using startup scripts.

Deep Learning is expensive

Here’s the situation: You’re working on some large dataset, and you feel the irresistible urge to release the Deep Learning beast on your models with VMs armed to the teeth with GPUs.

Since your local Macbook steps into the twilight zone every time you launch Keras, you decide to spin up a dragster style, GPU-powered VM on the Google Platform, AWS or Azure. Once the VM is ready, which in truth may take several days if it’s your first encounter with the CUDA Toolkit, you ssh into the VM and start working on your data and your models.

After a few hours of work, you’re still working on your scripts, cleaning up the data, training models, evaluating, and on and on. Time passes on while the earth pursues its never ending spin in the interplanetary void. When you realize your brain has as much jitsu left as a greek yogurt, you decide to call it a day and give the whole thing a rest. And of course, you sometimes forget to stop the instance. Your cash reserves leak out cent after cent, dollar/euro/pound after dollar/euro/pound throughout the night.

All these hours do add up. And at the end of the months you realize GPUs are way more expensive than you ever imagined. But hey Deep learning is really fun. Can’t stop now. Just need a few more hours. After all nobody really understands how these neural network work, do they? And you really need to practice to be able to call yourself a Deep Learning expert. Please just a few more hours of GPUs. Just 30 minutes, … I swear, …. come on!

So what’s a data scientist to do?

One solution is to go back to random forests and SVMs and give up on the whole deep learning thing. After all, as Vladimir Vapnik says, Deep Learning is just brute force training with a whole lot of data.

The other solution is to make the most of the on demand promise in cloud computing.

On demand and serverless

The whole promise of cloud computing is that you can spin up and release resources as needed. Way back in the early 2000s that mostly meant being able to add servers on the fly to support your traffic exploding when BoingBoing or Gizmodo suddenly put your startup on their front page. But for machine learning the same on demand concept is relevant when extra high computing power is needed. When working with Deep Learning, most of the mundane work of data cleaning and shaping can probably be carried out on your local machine or a low level VM. The only time GPU enabled VMs are truly needed is to train the Deep Learning models.

Which means that a resource conservative workflow should look like this:

1) Local: ETL, extract and explore dataset, clean and format data for consumption by Keras / Tensorflow / ….
2) Storage: Upload the properly formatted data to storage (GCS, S3, SQL, BQ) to make accessible by the VM
3) Local:Create a GPU enabled VM
4) EC2, GCE: run your Keras / TF / Pytorch script to train your model
5) EC2, GCE: Store intermediate and final results to storage, so you can access it from local
6) Shutdown or delete the VM when the script has finished running
7) Local: retrieve the models, make some predictions, evaluations, selections

Here, Local can be replaced by a smaller, less powerful VM running on CPUs and not GPUs.

With this workflow, you only spend money on expensive cloud resources on steps 3) and 4), potentially saving you significant amounts of cash at the end of the day. If you can manage to shutdown or even delete the VM once the script has finished running then you won’t even run the risk of leaving it running all through the night! Brilliant!

So in order to limit our resources, we need to be able to

1) Create a VM in the fly
2) Run a script on the VM from your local
3) Terminate the VM when the script is done

And we should do that (create, run, terminate) every time we want to test a new version of the dataset, a new DL architecture or new parameters. We could even potentially run several trials in parallel.

Let’s go!

Create a VM on the fly from an existing disk

I assume here that you already have created a VM and installed everything needed to run your scripts. Things like the conda distribution, scikit-learn, keras, GPUs and so forth. See How to setup a VM for data science on GCP and Launch a GPU-backed Google Compute Engine instance for more details (also this one) to install the CUDA Toolkit and the cuDNN library.

For those in a hurry, here’s an example of the command line for creating a preemptible VM on Google Cloud Engine (Ubuntu 17.10 with 50gb disk space). Preemptible VMs are temporary but way way cheaper than non preemptible ones. Great for research work not so much for production APIs.

gcloud beta compute --project "" instances create "" --zone "us-east1-c" \
--machine-type "n1-standard-1" --subnet "default" --no-restart-on-failure --maintenance-policy "TERMINATE" \
--preemptible --service-account ""  --image "ubuntu-1710-artful-v20180126" \
--image-project "ubuntu-os-cloud" --boot-disk-size "50" --no-boot-disk-auto-delete \
--boot-disk-type "pd-standard" --boot-disk-device-name ""

In the above and quite longish command, the non obvious but important flags are

--no-boot-disk-auto-delete: The disk will not be deleted when the instance is deleted
--preemptible: makes the VM temporary and saves money
--service-account "": A service account is used by your VM to interact with other Google Cloud Platform APIs. The default service account is identifiable with that email [PROJECT_NUMBER]-compute@developer.gserviceaccount.com where the [PROJECT_NUMBER] can be found on your project dashboard

So let’s assume that your have created your initial disk and deleted the associated VM. The disk is now free to be used to create a new VM.

The following command line creates a preemptible VM from that disk

gcloud compute instances create  --disk name=,boot=yes --preemptible

Ok so we can create a VM on the fly based on that disk. Now we want to run a script on that VM. Let’s say a python script.

The simplest way is to use the SSH command with the --command flag

gcloud compute ssh  \
--command '{Absolute path to }/python {absolute path to}.py'

and low and behold, that command will display the output of the remote python script on your local terminal. Try for instance gcloud compute ssh --command 'ls -al'.

A more sophisticated way is to automatically run a script (shell, python, R, whatever…) when the VM is created by using startup scripts

For instance, we could want to run the following shell script on creation

#! /bin/bash
sudo apt-get update

printf '%s %s Some log message \n' $(date +%Y-%m-%d) $(date +%H:%M:%S) >> '{absolute_path}/startup_script.log'

# add and activate the github keys
eval "$(ssh-agent -s)"
ssh-add  {path to github key}

# log script start
cd {path to application folder}

# git update application
git pull origin master
# run script
{path to }/python {absolute path to}/.py

That script activates the github keys, updates the application from github, and runs the script. Pretty neat. Creating the VM and making sure the VM runs that script is just an extension of the above VM creation command line But first you need to make the script available to the VM by uploading to Google Storage with gsutil

gsutil cp {local}/ gs://{bucket name}/

for more gsutil example check my post on https://alexisperrier.com/gcp/2018/01/01/google-storage-gsutil.html.

And now we can create the VM and have the script run on start. The following command line will attach the script to the VM as a startup script. Every time the VM is started the script will be executed.

gcloud compute instances create  --disk name=,boot=yes --preemptible \
--scopes storage-ro \
--metadata startup-script-url=gs://{bucket name}/  --preemptible

Now shutdown that VM!

So we are now able to

start a VM on demand from a disk
run a script on that VM from our local

We just need to find a way to terminate that VM once the script has finished running. This is where things get tricky mainly because it can be difficult to know with certainty when the script has finished running.

The simplest solution is to add the following shutdown line at the end of the startup script

shutdown -h now

This forces the VM to stop once the script is done. The VM still exists and is not deleted. In terms of pricing, this might just be enough as the VM is not priced when it’s idle. From the google doc: Instances that are in a TERMINATED state are not charged …. The associated disks, IPs, and other resources are still billed but not the VM.

It’s also possible to not only shutdown the VM but also delete it from within the VM. In other words having the VM commit seppuku. I mean specially if the script fails to run, the VM should take the blame for having failed its master (You) and rightly self suicide, makes sense. In a way maybe. The following code is from this thread. It can run from within the VM. It requires a bit of tuning on the service account scopes for it to have the proper permissions and actually work.

gcloud compute instances delete $myName --zone=$myZone --quiet

where the name of the VM comes from myName=$(hostname) and the zone from

# Get the zone
zoneMetadata=$(curl "https://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
myZone="${zoneMetadataSplit[3]}"

The only problem with that approach (stopping or deleting from the startup script) is that every time you start the VM, well, it will run the script and shutdown, and at the same time preventing you from ssh’ing into it to check the logs, make some modifications or inspect the results. Bummer!

The other solution would be to have your main model training script write a status update to a file or an external storage bucket or database, and from your local, regularly check the status of the script before deciding to the the shutdown command.

In the end, although a bit hacky, I think this is the best solution. Your whole workflow becomes now

Create the VM and set it up with all the right libraries and GPUs. Make sure the disk is persistent.
Delete the VM
Write the startup script that runs your model training python code. Make sure the python code exports its status (running, failed, done) to some external resources or some internal file as its being executed. Upload the startup script to a bucket on Google Storage
Create (1st time) / start (subsequent times) the VM and associate the startup script to it
Have a local conductor shutdown script regularly check the status of the training python execution on the VM and shutdown the VM when the status is failed or done

Note: Instead of using a startup script, you could also simply run the python model code using the gcloud compute ssh -- command command in conjunction with the local conductor shutdown script. But I feel the use of start up scripts basically dedicates the VM to that usage and that usage only. Similarly to writing good quality code, where a method or function should do only one thing at a time, my feeling is that a VM should be used only for one goal and one goal only. After all you can have as many VMs as you like as long as they are idle. Disks prices are low and usually not a problem.

Conclusion

The whole point behind using cloud resources is to leverage its at-will / on-demand capabilities to reduce costs. Doing so requires using startup scripts and some external monitoring to shutdown the VM once the task has been completed.

This is of course just one way of doing things.

Please let me know in the comments, how YOU manage costly on demand instances. And thanks for reading this post until the end.

Alexis Perrier

Fast scoring of a RAG pipeline over Q&A baseline

Fast scoring of a RAG pipeline over Q&A baseline

tldr;

RAG

Coherence assumption: The information is coherent or cohesive.

RAG scoring and evaluation

Benchmarking a RAG with Q&A generation

Let’s go

The corpus

Chunk, chunk baby chunk!!

Establish all the baselines !!!

Let’s generate the Q&A.

weaviate scoring

Math notations

Q&A Baseline

RAG Process

Performance evaluation of the RAG pipeline

Implementation

Expectations, assumptions and … results

distance is not always a good measure of meaning and relevance

The retrieved chunk can be misleading

Measuring the impact of the chunk

Stopwords in Weaviate

Set stopwords in a weaviate collection

Specify stopwords in a weaviate collection with the Python V4 API

Specify French Stopwords

Chunking will make or break your RAG!

Chunking will make or break your RAG!

RAG time

Retrieval

Generation

Conclusion

Further reading

Who is Moliere mocking?

Who is Molière mocking?

Rewriting Moliere plays with chatGPT

Rewriting Moliere plays with chatGPT

Why Tackle Molière?

Betrayal, stupefaction and stupidification !

To translate, adapt, modernize, certainly, but what are we talking about? What results can we expect?

An Example

The Quest for the Prompt

Automating Prompt Engineering

Key Challenges

Results

Intro to NLP, new course on Openclassrooms

A new Intro to NLP course

Best practices when sharing your data analysis - Jupyter Notebooks

Tips to make you data analysis easier to share

Context:

Compressed data in a bucket

host your notebook on google colab or on a similar executable platform

Demo and Prod mode

One operation per cell

Structure your code

From start to finish, is your notebook really running?

Add a requirements file

super meaningful variables names.

And do follow coding best practice such as:

Elsewhere:

New statistical learning course on openclassroooms

Teaching Data Science at UM6P

Reduce GPU costs with startup scripts on the Google Cloud Engine

Reduce GPU costs with on demand instances and startup scripts

Deep Learning is expensive

On demand and serverless

Create a VM on the fly from an existing disk

Now shutdown that VM!

Conclusion