RAG is a simple technique to explore a large corpus of documents that the LLM does not know about. But RAG makes a bold assumption: the answer to the query is present in the matching chunks of text. This is not always the case when answers are spread out across the entire text.
Another issue with RAG is the difficulty in evaluating its performance. Multiple elements of the pipeline have a significant impact on the relevance of the answers to our queries. Similarly to standard experiment loop in machine learning, we need to be able to evaluate a RAG with a simple unique score.
In this post, we add some steps to the RAG pipeline to tackle both the problem of close perimeter contained answer assumption and the evaluation method. We use Langchain and Weaviate.
[todo:what is RAG and what is it used for]
Given a large corpus composed of 100s, 1000s of textual files we want to explore the corpus in natural language. We have a question Q.
A standard RAG process goes as follows:
The embedding / preparation part:
The retrieval part, given a query:
And finally the generative part:
Note that the process requires 2 LLM models: an embedding one and a generative one.
In short we use a short extract of the one of the document in the corpus as context within the generative prompt to help the generative model answer the queston.
[short exemple]
We therefore make the assumption that the answer to the query is contained in the set of matching chunks given to the generative prompt as context of information. However that is not always the case.
Coherence assumption: The information is coherent or cohesive.
Consider for instance, a report on a public debate from some institution. The debate happened very recently and the generative model cannot know its content. The available report comes down to a 50 page pdf file. Too long to be given as a whole to the generative prompt.
Nopw, consider the question: who are the contributors to the debate ?
A simple query that can be solved with a simple python script. If the document is properly structured finding the answer does not even require any NER (named entity recognition) task. A simpl regex could work.
The RAG system will find a set of matching chunks or paragraphs from the pdf report. And assumming the prompt is efficient, the outpuanswert will consists of the contributors mentionned in the set of chunks. Obviously missing some contributors that happen to be mentionned in other parts of the report.
Now consider more abstract queries related to ideas, arguments or opinions that can be spread out over multiple chunks. The problem becomes more critical.
The power of machine learning comes from improving the model by using a tight, fast, try - score loop. The data scientist establishes a baseline, then experiments with different settings, models or data and scores each experiment. Iterativity is central to the process. Having a unique score to evaluate the performance of the model is key for this iterative process of fast iterations and improvements to take place.
In the context of LLMs with RAG or other applications, evaluating the relevance of the output, the answer is not so straightforward. A manual evaluation is time consuming and possibly biaised.
To remedy this problem and to be able to implement concise experiment loops, we need to add an extra step to the RAG pipeline:
This gives us a testing set on which we can evaluate any RAG setup and experiment.
We can expand on this benchmarking step with:
The assumption made here is that the distance, the score between the Q and A embeddings is a good proxy for the relavance of the RAG-generated answer from that corpus given a question. This is after all the main feature of embeddings, their ability to integrate meaning.
Consider a corpus composed of text on Moliere, his life and some of his plays. The texts are in French and related to the play l’Avare, or the Miser iin English.
The corpus is composed of 5 documents obtained from these urls
We use trafilatura to download the text versions of these pages. And edited each file so that paragraphs are separated by a double line return “\n\n”.
Although chunking seems like a simple task, there are many flavors. Langchain implements several text splitting methods : tiktoken via OpenAI, NLTKTextSplitter or a GPT2TokenizerFast via Huggingface andd a Recursive Text Splitter. https://api.python.langchain.com/en/stable/text_splitter/langchain.text_splitter.CharacterTextSplitter.html#langchain.text_splitter.CharacterTextSplitter.from_huggingface_tokenizer
The respective behavior of these text splitters and in particular the impact of the chunk_size and chunk_overalp parameters can be difficult to grasp.
Here’s a good Stack Overflow explanation on that topic.
And here is a streamlit based demo app that compares some of the different chunking methods.
In the end, we found that the separator parameter set by default to “\n\n” is what matters most in how the chunks are produced.
To establish the baseline Q&A, we need chunks with enough information for the LLM to be able to generate questions and associated answers.
We choose the tiktoken splitter with chunks of 600 tokens, overlapping by 100 tokens and we set the separator to the default value, a double line return “\n\n”.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size= 600,
chunk_overlap = 100,
separator = "\n\n",
)
This gives us between 5 and 25 chunks for the documents.
For each chunk we ask the LLM to write a set of question and answer given the text. We could directly use the langchain QAGenerationChain class. But the langchain prompts are in english and the chain may be unstable when trying to produce valid json. So we’ll implement our own chain.
We will take inspiration from the langchain prompts for QA generation and use these French ones instead
template = """
Tu es un professeur de français au lycée.
Tu dois écrire une paire de question response pour une interrogation écrite afin de tester les connaissances de tes élèves.
Ta réponse dois suivre le format JSON suivant
```
\{\{
"question": "$LA_QUESTION_ICI",
"réponse": "$LA_REPONSE_ICI"
\}\}
```
Tout ce qui est entre les ``` doit être du JSON valide.
Propose une paire question/réponse, dans le format JSON spécifié, pour le texte suivant :
----------------
{text}
"""```
The prompt follows the usual format
- role
- task description
- output format
- and actual task
The script generates a pair of question - answer for a given chunk.
``` python
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain, SequentialChain
llm_model = "gpt-4-1106-preview"
prompt_context = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0.9, model=llm_model)
chain = LLMChain(llm=llm, prompt=prompt, output_key="response", verbose=False)
overall_chain = SequentialChain(
chains=[chain],
input_variables=["text"],
output_variables=["response"],
verbose=True
)
response = overall_chain({ "text": chunk })
print(response['response'])
The output is satisfying.
Here’s a chunk (translated into English via Deepl.com):
In the first place, the moralist dramatist evokes the prodigality of the son, who of course reacts to his father's penny-pinching.
But in his rejection of his father's excesses, Cléante also indulges in excess.
Molière pokes fun at young people's propensity to follow expensive fashions and live in style.
Not without a touch of humor, Molière places in his father's mouth a reproach of precious style:
"you give furiously in the marquis".
It's true that wigs and ribbons are overpriced by the merchants who take advantage of the windfall.
And the generated Q&A is:
{
"question": "According to the text, what vice does Molière criticize in the son Cléante in addition to the father's avarice?",
"answer": "Molière criticizes Cléante's profligacy. He mocks his tendency to follow expensive fashions and lead an expensive lifestyle."
}
Good job GPT4 !
We still need to score these pairs.
Since we’re working on a French corpus, we have a decision to make. Either use an English centered embedding model such as OpenAI’s ada-02 or a French flavored one such as camembert or flaubert.
For the sake of simplicity of implementation we will use the OpenAI’s ada-02. Prior manual comparison in the retrieval phase related to Moliere’s plays of ada-2 vs camembert did not show a significant advantage from using the French based model. For now anway the important thing is the consistency of the process. We will evaluate the embedding model in a future experiment. However, being able to evaluate the embedding model is what triggered this work in the first place.
Let’s use the weaviate vector store and database to score the Q&A.
We will:
along with the original document names as meta information.
When searching the vector db for the proper answer to each question weaviate returns not only the answer from the original pair but also a matching score.
Weaviate offers 3 scorings: score, distance and certainty. see this page. By default the distance is the cosine distance.
So next steps is to create Question and Answer collections with the following columns
In the Answer collection, the answer property is vectorized, whereas in the Question collection, the question propertiy is vectorized
Answer collection
Question collection
Note: The question of optimality of the baseline is unknown at this stage. since the questions and answers are derived directly from the same chunk, the distance should be minimal. But since we’re daeling with LLMs we may find a way to get better QA matches with a different setup. Also note that the optimal, lower distacne answer to a question is if the answer is the same as the question. pffff mind blown!
Since we took large enough chunks, each chunk may contain more information than the one addressed by the question. So their embeddings should be further away from the question than if the chunk only contained the expected information. So it’s possible that the score of the Q,A given c is not optimal (minimal distance between Q and A).
I always feel that putting a problem into equations clears up things. But feel free to skip that paragraph and go to the implementation part.
Consider
From the set of chunks \(\{c\}_C\), we derive:
The baseline is the set composed by questions, answers, their embeddings and the distance
Consider now a RAG pipeline that we want to evaluate. The RAG pipeline has a text splitter \(C'\), embedding model \(E'\), LLM \(M'\) and prompt \(P'\) for the generative part.
A RAG pipeline is fully defined by the combination of these elements \(C', E', M', P'\).
The RAG process can be written as the sequence of steps:
Retrieval step then consists of
For a genuine user query \(Q\),
Generation step
By applying the RAG process to our set of baseline questions \(\{Q\}\), we obtain the set of distances between the questions \({Q}\) and the generated answers according to the Embedding model \(E'\).
For each \(Q\) from the set of Baseline question / answer pairs, we have
\({A'}\) : \(d(E'(Q), E'(A'))\).
The experiment results consists of the set \(\{ (c', Q, A', E'(Q), E'(A'), d( E'(Q), E'(A') ) \}\)
We can therefore evaluate the performance of the RAG pipeline \(C', E', M', P'\) by comparing the baseline and the experiment distances :
\[S_b = \{ d( E(Q), E(A) \}\]vs
\[S_e = \{ d( E'(Q), E'(A') \}\]where \(S_b\) and \(S_e\) are the set of scores obtained for the baseline and the experimental / evaluated RAG
Now 3 things can happen:
\(S_b ~= S_e\) The experiment generates answers that are as close to the questions as the answer directly generated from the text. This can be interpreted as a good result.
\(S_b > S_e\) The experimental RAG pipeline finds answers to the original questions that are closer than the answers derived directly from the corpus. This also indicates a good result.
\(S_b < S_e\) the baseline scores are lower than the experiment RAG scores. The experiment answers are not as close to the baseline answers. More work is required be considered to improve the performance of the RAG pipeline.
Let’s take that for a spin
We already have a baseline set of Q&A. Now we need embeddings and distances.
We’re going to use weaviate as a vector store. Weaviate provides
A collection is the equivalent of a table in a standard SQL database.
A collection is composed of properties (columns). You can specify which properties are vectorized or not. Properties that are vectorized are concatenated together before the vector is produced. See Configure semantic indexing for details.
We will create 2 collections, one for the answers from the baseline Q&A and one for the answers from the experiment.
We use a local installation of the datastore. See How to install Weaviate
Stay tuned for the rest of the implementation of the pipeline. This is a work in progress and some details need to be ironed out before I can publish the code.
Just to recap the main expectation for this evaluation method is to create a baseline of Q&As in order to score a given RAG pipeline composed of the 4 elements: chunkizer, vectorizer, generative model and prompt. This gives us a way to optimize the pipeline by changing these elements and monitoring the pipeline score.
However, all this rests on another assumption, a huge one in fact. The underlying assumption is that the distance between embeddings of the baseline questions and the different answers produced by the RAG pipelines measures the relevance of the answer with regard to the question.
This works in simple cases where the answer is short and very specific.
But when the answers are more complex, the distance will be very dependent on the occurences of the words within the question and the answer.
for instance
Most of the words in the question are also in the answer and the resulting distance will be lower than for thiis answer
- why is Paris a good tourist destination
- The city offers a great selection of restaurants and museums
The RAG method also assumes that the retrieved chunk is pertinent to the question. This is the repsonsibility of the retrieval phase, to find the most relevant chunk of text with regard to the question.
But when the retrieved chunk does brings the wrong information, the generated answer will not be correct.
Since retrieval is highly dependant on the wording of the question, nothing ensures that the retrieved chunk works as expected.
For instance, consider a debate on a hot topic where chunks can either reflect a position P or its contrary N.
depending on the words in the question, the retrieved chunks will lean towards one side P or the other N. And the generated answer will reflect the bias in the chunks.
This happened a few times in my experiment when the chunks were the summary of the different acts of the play and the question was about the reasons behind the nehavior of one of the character of that play in a given act.
“In act I, why does Harpagon quarrel with his son.””
As the plot was evolving between acts, but Harpagon kept quarreling with his son but for different reasons , depending on the summary of the act that was retrieved the answer ended up different.
Finally, depending on the wording of the wrong answered it ended up with a better score than a more relevant answer.
On the positive side, the method allows to estimate the impact of using the extra information as context within the answer generating prompt. All ogher thing sbeing equal, the answer generated by the prompt with the chunk had lower distance to the question than the answer generated without the chunk
]]>tl;dr: Here’s the solution: update the collection configuration with
collection.config.update(
# Note, use Reconfigure here (not Configure)
inverted_index_config=wvc.Reconfigure.inverted_index(
stopwords_additions=["le", "la", "il", "elle"]
)
)
and check with
client.collection.get()
I work on a French corpus composed of Moliere’s plays and use weaviate as a vector database to store the embeddings. Weaviate also does the vectorization of the text provided you specify a vectorizer.
An important element when embedding text is of course the presence of stopwords. This is particulalry important when dealing with long chunk of text where stopwords tend to obsfucate the meaning, the signal contained in that text.
Weaviate has recently released a V4 of its python API. And the documenttation is not yet finished (as of 12/2023).
In terms of stopwords the default set is predefined to the English language. There is no set available for French, Spanish or any other non-English language.
The way to define a specfic list of stopwords is to set it explicitly with the stopwords_additions
parameter when creating the collection for the Python V4 API. The Python V3 API documentation on collection / class creation is quite exhaustive .
So the other day I find myself stuck trying to add that list of French stopwords but not finding the related documentation.
I found the solution after a good night’s sleep.
Creating a collection in Weaviate goes as follows
collection = client.collections.create(
name= <collection_name>,
vectorizer_config=vectorizer,
generative_config=wvc.Configure.Generative.openai(),
properties=[
< list of properties>
]
)
where vectorizer can be the OpenAI’s ada-02 model
import weaviate.classes as wvc
vectorizer = wvc.Configure.Vectorizer.text2vec_openai(vectorize_class_name = False)
I’m using a local install of weaviate so the client is instantiated with
import weaviate
client = weaviate.connect_to_local(port=8080, grpc_port=50051,
headers={ <specify your keys to OpenAI, Huggingface etc > }
)
To specify stopwords you need to use the wvc.Configure.inverted_index(**params)
function with the right parameters.
All the values below are the default ones except for the stopwords_additions
key
params = {
"bm25_b": 0.75,
"bm25_k1": 1.2,
"cleanup_interval_seconds": 60,
"index_timestamps": False,
"index_property_length": False,
"index_null_state": False,
"stopwords_preset": None,
"stopwords_additions": list_stopwords,
"stopwords_removals": None,
}
Then create your collection setting inverted_index_config = wvc.Configure.inverted_index(**params),
collection = client.collections.create(
name= <collection_name>,
vectorizer_config=vectorizer,
generative_config=wvc.Configure.Generative.openai(),
inverted_index_config = wvc.Configure.inverted_index(**params),
properties=[
< list of properties>
]
)
You can also not specify the inverted_index_config
parameter when crearing the collection, to update later the config with
collection.config.update(
# Note, use Reconfigure here (not Configure)
inverted_index_config=wvc.Reconfigure.inverted_index(
stopwords_additions=["le", "la", "il", "elle"]
)
)
To check that the config has been updated, you can list and check the collection config with
client.collection.get()
In my case, with a small list of French stopwords this returns
_CollectionConfig(name='Test', ...,
inverted_index_config=_InvertedIndexConfig(
bm25=_BM25Config(b=0.75, k1=1.2),
...
stopwords=_StopwordsConfig(
preset=<StopwordsPreset.EN: 'en'>,
additions=['il', 'elle', 'je', 'tu', 'nous', 'vous', 'ils', 'elles'],
removals=None
)),
<some other info>
)
Notice that although I specified "stopwords_preset": None,
in the params, the preset stopwords has still been set to <StopwordsPreset.EN: 'en'>
. I could not find a way to set the preset to None. Although this is a valid value as specified in the documentation.
So it’s expected that both the predefined set of English stopwords and the added list of French stopwords will be removed from the teext to embed.
This short post explains how to explore a large proprietary corpus using a RAG strategy and emphasizes the importance of the chunking step of the pipeline.
(no, not the Scott Joplin Ragtime, sorry jazz lovers)
RAG stands for Retrieval Augmented Generation. It is a technique used to question and explore large collections of documents. RAG is a simple way to leverage LLMs on a proprietary corpus without resorting to more expensive and complex tuning strategies.
The initial preparation step consists in splitting the documents from your corpus into smaller parts called chunks. Chunking breaks down large text into smaller segments to optimize content relevance.
Chunking is followed by the embedding phase: computing a vector representation aka embedding of each chunk . Each text chunk becomes a vector.
There are multiple ways to chunk a document and compute embeddings.
These embeddings are then stored in a vector database such as weaviate. A vector database has 2 main roles: 1) storing the text and related vectors 2) enabling a super fast matching between 2 embeddings.
In short:
Now we can start querying your corpus.
As the name indicates, the RAG pipeline then consists in 2 steps: retrieval and generation.
Given your query or question.
The resulting chunks are used as the information that can help answer your question using an LLM and a properly structured prompt.
The prompt has the following structure: role, context and query:
Acting as
{insert specific role: teacher, analyst, nerd, author, ...}
Use the following information:
{insert resulting text chunks}
to answer this question:
{insert your initial question}
This is the overall RAG strategy. Whether it works or not will depend on multiple factors.
Chunking can be done at the sentence level, paragraph or be fixed size (see langChain TextSplitters for instance). Some overlap between chunks is usually considered to liaise consecutive chunks. Although the other steps (embedding, LLM) are worthy of attention, finding a good chunking strategy is where the challenge lies in order to get relevant answers.
A good chunking strategy must capture both meaning and context. Short chunks will preserve meaning but lack context while long chunks will tend to smooth out nuances of each sentence.
“Embedding a sentence focuses on its specific meaning, while embedding a paragraph or document considers overall context and relationships between sentences, potentially resulting in a more comprehensive vector representation but with the caveat of potential noise or dilution in larger input sizes.” Thx, chatGPT!
So when defining a good chunking strategy, keep in mind:
The idea is to align the chunking strategy with the user queries to establish a closer correlation between the embedded query and the embedded chunks.
Chunking is not a challenge that can be solved with an AI model. A simple python script will do. It’s a simple task that boils down to splitting large text in smaller parts. But it is the foundation that will make your RAG system generate quality answers.
tldr;: the enduring appeal of Molière’s plays lies in their ability to serve as mirrors reflecting universal human flaws and shortcomings. The humor in Molière’s characters, often perceived as mockery, actually prompts self-reflection and self-awareness, offering a form of absolution for common human imperfections.
One aspect of teaching Molière’s plays in schools is based on the archetypal nature of his most caricatured characters. The archetype of a given subject relies on a strong, characterized, recognizable representative image.
According to textbooks, the humor in Molière’s plays comes from the absurdity of characters with exaggerated traits. They are old, irritable, authoritarian, but above all, obsessive, often gullible, and always ridiculous. Harpagon for his greed, Géronte for his intransigence, Argan for his hypochondria and unwavering faith in quack medicine. And many others.
The spectator or reader laughs at the expense of the character, the target of their judgment, and constructed for this purpose by the author. A somewhat ungenerous laugh, a schoolyard laugh where we mock the weaker, the dumber, the older, the more ridiculous with a certain amount of malice.
But this type of humor cannot explain the timelessness of Molière’s works. Ridiculing others is childish and bitter. It destroys empathy.
If we return to the notion of archetype as an essential and universal trait of the collective unconscious Jungian archetypes, it seems far more likely that we are actually laughing at ourselves, at our less admirable traits and buried faults. It’s a nobler, thankfully, self-deprecating laughter that shines a light on our internal psychology. The archetypes in Molière’s plays are a mirror. We revisit and reread and continue to teach Molière because we are the subjects of his plays.
But failing to really admitting it clearly afterward. Although not that dreadful, these behaviors don’t really highlight us.
Reading these Molière plays carefully in order to adapt them into modern French, I realize with amusement that these caricatured characters also exist within me. I find myself, unintentionally, playing these famous scenes in real life. The quarrel scene between Sganarelle and his wife Martine is a recent good example. I am also not a stranger to nitpicking over pennies from time to time.
Molière’s humor sometimes borders on the grotesque, with scenes of beatings or vaudeville with situations. But what makes Molière’s plays such classics, centuries after their publication, is primarily, it seems to me, their role as beneficial mirrors. They offer us absolution for our very human lapses.
This is a GPT supported translation of De qui Moliere se moque-t-il ?
]]>tldr: I used GPT4 to generate modernized versions of Molière’s works. The goal is to facilitate access to these works which are omnipresent in the French school curriculum while vernacular French is evolving rapidly. By simplifying the structure of sentences, reducing the length of lines, and modernizing the vocabulary, we obtain simplified versions while maintaining the integrity of the original works.
LLMs (Language Models) like chatGPT have disrupted learning and teaching in just a few months.
It’s a known fact that LLMs hallucinate. An LLM does not know the subject it handles. The model can write a recipe but has never tasted chicken.
For example, when you ask chatGPT to summarize “Les Fourberies de Scapin” (Scapin the Schemer), you get a text that is certainly inventive but fundamentally false, seemingly containing pieces from “Le Médecin Malgré Lui.” (the Doctor despite himself):
“Scapin stands in the middle of the courtyard, in front of a large wooden crate, pretending to be a ‘doctor’ or a ‘sorcerer’ capable of curing the lovers’ woes.”
This will surely get you a zero, or F, from the French teacher.
Therefore, the relevant question is not the accuracy of the produced text but its plausibility. The strength of LLMs lies in their ability to generate text, not in aligning facts. Among the many use cases, these models can facilitate the understanding of classical texts by offering a simplified version of potentially challenging-to-understand texts.
Molière’s plays are ubiquitous in the French school curriculum. Molière is hailed as the embodiment of French genius in terms of theater, comedy, and also as a social critic whose modernity can never be questioned. His characters, Harpagon, Scapin, Sganarelle, are known to all French. His best lines are ingrained in the French psyche.
“One must eat to live, not live to eat,” “But what was he doing in that galley,” and many others.
The texts of these 17th-century works have already been adapted into French at some point. The raw texts from the 1660s would be quite challenging to understand today. However, this version is aging rapidly. It has become difficult for young (and not-so-young) generations to grasp. Comments on social media abound about the difficulty of understanding the texts and therefore the story.
I’m in 9th grade, and my French teacher asked me to read this book, but I didn’t understand anything.
Hence the idea of using chatGPT to translate and adapt Molière’s plays into modern French.
Our goal is to make these works easier to read and understand for a population that is increasingly less accustomed to reading books, and whose everyday French is rapidly diverging from the dusty standards of the French Academy.
Daring to mention the modernization or simplification of Molière’s texts instantly provokes a visceral and indignant rejection. The central accusation being the dumbing down and its corollary, the inexorable decline in educational standards. These arguments reek of “it was better in the past,” the anti-screen brigade, the agony of French in the face of English, and other fallacies about the education of yesteryears.
Simplifying the text would supposedly hasten a decline in students’ standards.
However, correlation (assumed) does not imply causation.
The Bible has been translated, Shakespeare has been adapted into contemporary English, so why can’t Molière be as well? The texts are not sacred. The approach is democratic. And the goal is clear: to facilitate access to classical theater plays.
Let’s clarify right away. This is not about summarizing the play, nor excessively simplifying the texts, and certainly not about making them sound youthful with supposed youth language.
Our goal, therefore, will be to:
We will preserve:
And, of course, we won’t hesitate to keep the original text when it holds no particular difficulty.
Let’s take an example. In Act 1, Scene 1 of L’Avare (The Miser), Valère opens the play with these words:
« Hé quoi ! charmante Élise, vous devenez mélancolique, après les obligeantes assurances que vous avez eu la bonté de me donner de votre foi ? Je vous vois soupirer, hélas ! au milieu de ma joie ! Est-ce du regret, dites-moi, de m’avoir fait heureux ? et vous repentez-vous de cet engagement où mes feux ont pu vous contraindre ? »
“What! Lovely Élise, you are becoming melancholic, after the obliging assurances you had the kindness to give me of your faith? I see you sigh, alas! in the midst of my joy! Is it regret, tell me, for having made me happy? And do you repent of this commitment where my passions may have compelled you?”
It’s beautiful, flowery, and delightfully romantic!
The modern version reads:
« Pourquoi cette tristesse, Élise, après m’avoir assuré de ton amour ? Je te vois soupirer, est-ce du regret de m’avoir rendu heureux ? Regrettes-tu notre engagement ? »
or in English:
“Why this sadness, Élise, after assuring me of your love? I see you sigh, is it regret for making me happy? Do you regret our commitment?”
It’s less beautiful but way more straightforward.
In this example, we touch upon one of the main challenges of the exercise. The beauty of the original text, its style, rhythm, and tensions, all fall into the realm of flavor and music. There is poetry in these lines, even though the text is in prose. The modern version is much more plain in comparison, but it offers conciseness and clarity. What is lost in beauty is gained in efficiency.
With GPTs and LLMs, the prompt is everything. Without a prompt, there is no salvation. The prompt will dictate the quality of the result: form, format, style, and, most importantly, the preservation of meaning. We have worked with two models: GPT 3.5 and GPT 4 via the openAI API and tested numerous configurations and prompts.
The classic process of optimizing a machine learning model involves defining a metric to maximize by selecting the best model’s meta-parameters. This model must also be robust, meaning it performs consistently in the face of slight variations in input data. This iterative process allows testing multiple configurations and achieving the best possible results based on context, approach, and available data.
Such a process would give us a systematic approach to finding the best prompt. However, it remains challenging to implement in our text transformation context (Automatic Text Simplification (ATS)). This is for two reasons.
Firstly, the inherently random nature of LLMs makes the results inconsistent. The nature or quality of generated text varies depending on API request parameters, the prompt, and the input text to be modified. Even when setting the model’s temperature to zero and using an identical prompt, we cannot control the model’s response to incoming original text.
Secondly, complexity measures of a text are not suitable for our context of simplifying 16th-century text corpora.
We have used the LyngX library, which offers several psycholinguistic complexity metrics (DNT, IDT, …). Unfortunately, there appears to be no correlation between simplified lines and complexity scores obtained with these methods.
At this stage, building an automation for prompt selection for automatic text simplification seems to require more effort than we can afford. Our primary goal remains to quickly have publicly available modernized versions of Moliere’s plays.
For that reason we opted for a manual selection of prompts, models, and query parameters. In the end, after many trials, our prompts follow the format:
For example:
Rewrite the text in modern French:
- Basic vocabulary;
- Clear and short sentences;
- Reduce the paragraph length;
{text}
or
Write this text in simple and concise French style:
text:
{text}
where {text}
is replaced by a line of dialogue, an entire scene, or an excerpt consisting of a series of lines of dialogue.
Translating line by line often results in loss of meaning since the model lacks awareness of context,
or in a result in a narrative form: Géronte speaks to his son and says this
instead of the dialogue. Géronte: <the line>
On the contrary, submitting each scene in its entirety leads to a reduction in the number of lines in the translated version.
This is one of the peculiarities of Molière’s texts. The ping-pong dialogues, consisting of a rapid exchange between 2 characters who repeat very similar same lines. For example, in Act II, Scene 4 of Le Médecin Malgré Lui:
GÉRONTE: Vous donner de l'argent, Monsieur.
SGANARELLE: Je n'en prendrai pas, Monsieur.
GÉRONTE: Monsieur...
SGANARELLE: Point du tout.
GÉRONTE: Un petit moment.
SGANARELLE: En aucune façon.
GÉRONTE: De grâce!
SGANARELLE: Vous vous moquez.
GÉRONTE: Voilà qui est fait.
SGANARELLE: Je n'en ferai rien.
GÉRONTE: Hé!
which reads as:
GÉRONTE: To give you money, Sir.
SGANARELLE: I won't take any, Sir.
GÉRONTE: Sir...
SGANARELLE: Not at all.
GÉRONTE: Just a moment.
SGANARELLE: In no way.
GÉRONTE: Please!
SGANARELLE: You're joking.
GÉRONTE: There you go.
SGANARELLE: I won't do it.
GÉRONTE: Hey!
As the content varies little from one line to the next, the model, which has just been instructed to simplify the text, will reduce the number of lines. This leads to the loss of the valuable one-to-one equivalence between the original and modern versions.
We eventually opted for a middle ground between translating line by line and translating entire scenes, using a sliding window of 5, 10, 15 lines with an overlap of 2, 5, or 7 lines between queries. This provides the model with enough context to avoid the problems mentioned earlier.
However, this entails reviewing the different versions obtained to select the best one for each line in terms of the meaning discussed earlier. There is a degree of subjectivity in this selection process, which then becomes a traditional review and editing task.
In the end, for the play “Les Fourberies de Scapin” (771 lines), we kept ⅓ (232 lines) in their original version, manually rewrote 47 lines (6%). The remaining two-thirds were equally contributed by GPT 3.5 (224) and GPT 4 (268).
We achieve a modern version of the texts that fulfills all the previously stated objectives: simplification and shortening of sentences, refreshing vocabulary, phrases, and styles, switching from formal to informal language (tu/vous), all while preserving the meaning and maintaining a 1-1 equivalence.
The available plays currently include:
The texts are displayed in bilingual mode, with the revised text side by side with the original, allowing for comparison and reading of the modern version without losing the original version’s rhythmic and humorous qualities.
]]>[2023 update] The course has been revamped and updated with new content. It’s an intro to NLP course with a focus on Enbeddings. We cover what is now old tech (Word2Vec) but which is still very relevant in the age of attention and transformers. If you’re new to NLP this is a good way to start and get to work with SpaCy, NLTK and other standard NLP librairies.
[2021]
Intro to NLP course on Openclassrooms.
This could not have happened without OC’s amazing team, with a special shoutout to Alexandra. :)
The course covers basic BOW to static embeddings, glove style, with NLTK, Spacy and Gensim.
I tried to make the course more interesting and engaging by working on classic texts and funky song lyrics. Among other things we study the white rabbit in Alice in Wonderland, aliens in War of the Worlds and love and swords in Shakespeare.
Enjoy
Alexis
]]>You work on a large dataset let’s say over 1Gb. You do an analysis. And you want to share
so that other people can work / reproduce / tweak your results.
Here are a few personal tips to make things easier for the poor schmuck / schmuckette who has to read your code.
Host your data on S3, google storage, Azure, dropbox etc …. whatever fits your mood as long as it can provide a unique URI.
Sharing datasets in an email, or in google drive is flaky and confusing. Drive is not the right place to host datasets. Space is limited, and access control can be hazy.
By hosting the dataset on the cloud:
When you share your notebook, the data is downloaded using this unique URI instead of
<this is my local path, don't forget to change it to your own local path>.
However, managing access permissions on specific items in the cloud can be a real pain.
By the way pd.read_csv
natively reads gzipped files :). Just add the compression='gzip'
parameter:
df = pd.read_csv('S3_bucket/sample.tar.gz', compression='gzip')
Your script may be efficient, bug free, superbly commented out etc … but still end up only working only on your platform. I’ve had the case recently of a friend, not particularly python savvy, trying to open a 1.9 Gb text file on a windows machine and being faced with abstruse unicode errors. He was stuck. However, the same script worked like a charm on my mac.
So hosting the notebook on Google Colab will go a long way to make it reproducible without undue efforts.
If the data is large and running the whole notebook takes forever, it’s always a good idea to implement 2 modes: a sanbox and a production one with a simple test. Something as simple as: MODE = “demo” MODE = “prod”
if MODE == 'demo':
# subsample the large initial dataset
else:
# basically do nothing
# then the rest of the code and results etc ...
This way the recipient of your analysis can run the whole script quickly and start playing with the parameters and results right away instead of having to wait for loops or apply lambdas to finish.
You can choose whichever mode as the default one depending on your audience.
Following the single responsibility principle, is an excellent practice when working with jupyter noteboooks.
The single responsibility principle states that every module, class, function should have responsibility over a single part of the functionality provided by the script. wikipedia single responsibility principle
This allow the user to insert other cells to explore the resulting objects and data. Very useful.
The ruby community is very strong on that single responsibility principle with excellent results in terms of bug reduction, readability and maintainability of the code.
The more structure the better. By default
The main default of Jupyter notebooks is the lost state problem where a cell depends on previous runs of other cells which may have already been modified. So just making sure everything works as intended from importing the libraries to the end results is adamant before sharing.
I find this optional but that’s just my ingrained laziness. See this post for more on the subject by JakevanderPlas on Installing Python Packages from a Jupyter Notebook
Can’t emphasize this one enough. I often spend significant amounts of time looking for synonyms that will convey precisely the true nature of an important variable to a reader, myself included. The time gained by abbreviating any variable will be lost a thousand fold later on when trying to figure out what the variable stands for.
Comments should focus on explaining the choices made in terms of methods and parameters. Not simply rephrasing the code.
Google has a longer, more precise list of excellent best practices when working on Google Colab.
A good paper on Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks
Please drop me a line on twitter @alexip if you’d like to add something or comment an item.
Cheers!
]]>Design Effective Statistical Models to Understand Your Data
In this course I explore linear, logistic and polynomial regression with hands on exercises, real-world use-cases and non trivial datasets.
Regression is the mother of all statistical models. Simple, flexible and highly interpretable.
To shine a new light on such a venerable topic, I decided to bridge the gap between the classic statistical approach and the machine learning one.
Regression in the statistical sense aims at modeling the inner dynamics of a dataset. The method uses multiple statistical tests to validate the relevance and reliability of the observations and results.
On the other hand, the machine learning approach strives to build models that perform well on previously unseen data. We no longer care about p-values, null hypothesis or statistical tests but focus instead on the performance of the trained model on new data.
The great thing is that we can use the same simple modeling techniques, linear regression to illustrate both approaches. Bridging the gap between statistical modeling and machine learning.
Here’s the outline of the course:
I - Understand the Fundamentals of Statistical Modeling
II - Build Linear Regression Models
III - Build Generalized Linear Models
IV - Build Resilient Predictive Models
The course notebooks are available on github: Notebooks for the course Design Statistical Models on OpenClassrooms
]]>I had the chance of teaching a 2-week session of predictive analytics in Morocco in the fall 2018 with students of the Emines school of industrial management in their fourth and final year.
In 2018, the renowned École Polytechnique, Mohammed VI Polytechnic University and the Foundation of École Polytechnique launched a Chair in “Data Science and Industrial Processes” in Morocco which I inaugurated on the amazing UM6P campus of the Emines school of industrial management located in Benguerir Morocco. This campus was designed by the famous architect Ricardo Bofill in 2011. It is quite a sight to see and watch.
The desired outcome of this 2-week course was to give the students a deep and practical understanding of data science and machine-learning in terms of scope and tools with a focus on building a hands-on data-driven experience.
For the students to be able to work on relevant data science problems within just two weeks of teaching is a real challenge. Data science is a multi-faceted activity and several domains have to be taught in parallel: software engineering in order to build robust and reproducible scripts, data science techniques and workflows to generate reliable results, probabilistic theory underlying machine-learning algorithms to go beyond the simple copy-pasting usage of popular libraries, and most of all, an engineer can-do attitude required to tackle real world problems. Data science demands pragmatism, resilience, creativity from students and experienced data scientists alike.
The course was organized to leave a large place to interactions and Q&A between the students and the teacher and to limit the amount of time dedicated to slides-based lectures. Every time a new machine-learning concept or modelisation algorithm was introduced, it was followed by lab work using applied to datasets on increasing complexity and scope. The goal being to expose the students to real-world issues that data scientists often face in predictive analytics projects. The course slides, datasets and scripts are available on github.
In parallel, the students participated in the Iowa Housing Kaggle competition to practice their newly acquired skills in a challenging project. Their enthusiasm was impressive as exclamations of joy broke silent phases of deep focus whenever someone reached a new high score.
To have had the privilege of launching this first session of the DSIP chair with such accomplished and convivial students and engineers was a fantastic experience and a great honor. Thanks to the excellence and enthusiasm of UM6P Emines students, the learning outcomes have been fully reached.
I will have the pleasure to close this year’s data science course in March 2019 by teaching a week-long course this time focused on probabilistic programming and PyMC3.
]]>This post is about leveraging on demand capabilities of costly virtual instances on the Google Cloud Engine using startup scripts.
Here’s the situation: You’re working on some large dataset, and you feel the irresistible urge to release the Deep Learning beast on your models with VMs armed to the teeth with GPUs.
Since your local Macbook steps into the twilight zone every time you launch Keras, you decide to spin up a dragster style, GPU-powered VM on the Google Platform, AWS or Azure. Once the VM is ready, which in truth may take several days if it’s your first encounter with the CUDA Toolkit, you ssh into the VM and start working on your data and your models.
After a few hours of work, you’re still working on your scripts, cleaning up the data, training models, evaluating, and on and on. Time passes on while the earth pursues its never ending spin in the interplanetary void. When you realize your brain has as much jitsu left as a greek yogurt, you decide to call it a day and give the whole thing a rest. And of course, you sometimes forget to stop the instance. Your cash reserves leak out cent after cent, dollar/euro/pound after dollar/euro/pound throughout the night.
All these hours do add up. And at the end of the months you realize GPUs are way more expensive than you ever imagined. But hey Deep learning is really fun. Can’t stop now. Just need a few more hours. After all nobody really understands how these neural network work, do they? And you really need to practice to be able to call yourself a Deep Learning expert. Please just a few more hours of GPUs. Just 30 minutes, … I swear, …. come on!
So what’s a data scientist to do?
One solution is to go back to random forests and SVMs and give up on the whole deep learning thing. After all, as Vladimir Vapnik says, Deep Learning is just brute force training with a whole lot of data.
The other solution is to make the most of the on demand promise in cloud computing.
The whole promise of cloud computing is that you can spin up and release resources as needed. Way back in the early 2000s that mostly meant being able to add servers on the fly to support your traffic exploding when BoingBoing or Gizmodo suddenly put your startup on their front page. But for machine learning the same on demand concept is relevant when extra high computing power is needed. When working with Deep Learning, most of the mundane work of data cleaning and shaping can probably be carried out on your local machine or a low level VM. The only time GPU enabled VMs are truly needed is to train the Deep Learning models.
Which means that a resource conservative workflow should look like this:
Here, Local can be replaced by a smaller, less powerful VM running on CPUs and not GPUs.
With this workflow, you only spend money on expensive cloud resources on steps 3) and 4), potentially saving you significant amounts of cash at the end of the day. If you can manage to shutdown or even delete the VM once the script has finished running then you won’t even run the risk of leaving it running all through the night! Brilliant!
So in order to limit our resources, we need to be able to
And we should do that (create, run, terminate) every time we want to test a new version of the dataset, a new DL architecture or new parameters. We could even potentially run several trials in parallel.
Let’s go!
I assume here that you already have created a VM and installed everything needed to run your scripts. Things like the conda distribution, scikit-learn, keras, GPUs and so forth. See How to setup a VM for data science on GCP and Launch a GPU-backed Google Compute Engine instance for more details (also this one) to install the CUDA Toolkit and the cuDNN library.
For those in a hurry, here’s an example of the command line for creating a preemptible VM on Google Cloud Engine (Ubuntu 17.10 with 50gb disk space). Preemptible VMs are temporary but way way cheaper than non preemptible ones. Great for research work not so much for production APIs.
gcloud beta compute --project "<project_name>" instances create "<instance_name>" --zone "us-east1-c" \
--machine-type "n1-standard-1" --subnet "default" --no-restart-on-failure --maintenance-policy "TERMINATE" \
--preemptible --service-account "<your_service_account>" --image "ubuntu-1710-artful-v20180126" \
--image-project "ubuntu-os-cloud" --boot-disk-size "50" --no-boot-disk-auto-delete \
--boot-disk-type "pd-standard" --boot-disk-device-name "<disk_name>"
In the above and quite longish command, the non obvious but important flags are
--no-boot-disk-auto-delete
: The disk will not be deleted when the instance is deleted--preemptible
: makes the VM temporary and saves money--service-account "<your_service_account>"
: A service account is used by your VM to interact with other Google Cloud Platform APIs. The default service account is identifiable with that email [PROJECT_NUMBER]-compute@developer.gserviceaccount.com
where the [PROJECT_NUMBER] can be found on your project dashboardSo let’s assume that your have created your initial disk and deleted the associated VM. The disk is now free to be used to create a new VM.
The following command line creates a preemptible VM from that disk
gcloud compute instances create <instance name> --disk name=<disk name>,boot=yes --preemptible
Ok so we can create a VM on the fly based on that disk. Now we want to run a script on that VM. Let’s say a python script.
The simplest way is to use the SSH command with the --command
flag
gcloud compute ssh <instance name> \
--command '{Absolute path to }/python {absolute path to}<the script>.py'
and low and behold, that command will display the output of the remote python script on your local terminal. Try for instance gcloud compute ssh <instance name> --command 'ls -al'
.
A more sophisticated way is to automatically run a script (shell, python, R, whatever…) when the VM is created by using startup scripts
For instance, we could want to run the following shell script on creation
#! /bin/bash
sudo apt-get update
printf '%s %s Some log message \n' $(date +%Y-%m-%d) $(date +%H:%M:%S) >> '{absolute_path}/startup_script.log'
# add and activate the github keys
eval "$(ssh-agent -s)"
ssh-add {path to github key}
# log script start
cd {path to application folder}
# git update application
git pull origin master
# run script
{path to }/python {absolute path to}/<the script>.py
That script activates the github keys, updates the application from github, and runs the script. Pretty neat. Creating the VM and making sure the VM runs that script is just an extension of the above VM creation command line But first you need to make the script available to the VM by uploading to Google Storage with gsutil
gsutil cp {local}/<shell script> gs://{bucket name}/
for more gsutil example check my post on https://alexisperrier.com/gcp/2018/01/01/google-storage-gsutil.html.
And now we can create the VM and have the script run on start. The following command line will attach the script to the VM as a startup script. Every time the VM is started the script will be executed.
gcloud compute instances create <instance name> --disk name=<disk name>,boot=yes --preemptible \
--scopes storage-ro \
--metadata startup-script-url=gs://{bucket name}/<shell script> --preemptible
So we are now able to
We just need to find a way to terminate that VM once the script has finished running. This is where things get tricky mainly because it can be difficult to know with certainty when the script has finished running.
The simplest solution is to add the following shutdown line at the end of the startup script
shutdown -h now
This forces the VM to stop once the script is done. The VM still exists and is not deleted. In terms of pricing, this might just be enough as the VM is not priced when it’s idle. From the google doc: Instances that are in a TERMINATED state are not charged …. The associated disks, IPs, and other resources are still billed but not the VM.
It’s also possible to not only shutdown the VM but also delete it from within the VM. In other words having the VM commit seppuku. I mean specially if the script fails to run, the VM should take the blame for having failed its master (You) and rightly self suicide, makes sense. In a way maybe. The following code is from this thread. It can run from within the VM. It requires a bit of tuning on the service account scopes for it to have the proper permissions and actually work.
gcloud compute instances delete $myName --zone=$myZone --quiet
where the name of the VM comes from myName=$(hostname)
and the zone from
# Get the zone
zoneMetadata=$(curl "https://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
myZone="${zoneMetadataSplit[3]}"
The only problem with that approach (stopping or deleting from the startup script) is that every time you start the VM, well, it will run the script and shutdown, and at the same time preventing you from ssh’ing into it to check the logs, make some modifications or inspect the results. Bummer!
The other solution would be to have your main model training script write a status update to a file or an external storage bucket or database, and from your local, regularly check the status of the script before deciding to the the shutdown command.
In the end, although a bit hacky, I think this is the best solution. Your whole workflow becomes now
Note: Instead of using a startup script, you could also simply run the python model code using the gcloud compute ssh -- command <python script>
command in conjunction with the local conductor shutdown script. But I feel the use of start up scripts basically dedicates the VM to that usage and that usage only. Similarly to writing good quality code, where a method or function should do only one thing at a time, my feeling is that a VM should be used only for one goal and one goal only. After all you can have as many VMs as you like as long as they are idle. Disks prices are low and usually not a problem.
The whole point behind using cloud resources is to leverage its at-will / on-demand capabilities to reduce costs. Doing so requires using startup scripts and some external monitoring to shutdown the VM once the task has been completed.
This is of course just one way of doing things.
Please let me know in the comments, how YOU manage costly on demand instances. And thanks for reading this post until the end.
]]>