Is ChatGPT only trained in English as Melanchoten claims?

Jean-Luc Mélenchon questioned the way ChatGPT was trained, saying it was based only on English texts. The technical reality is different. If the corpus consisted mainly of English, other languages ​​were used to develop the chatbot.

This is an innocuous tweet posted on November 5, 2023 on Jean-Luc Mélenchon’s X account (formerly Twitter). One tweet among many, it transcribed several notable interventions by France’s president, Insoumise. The person concerned held a “Do Better” conference in Strasbourg where various topics were discussed.

It was then that the MP addressed the technical news, dealing with the functionality of ChatGPT, the conversational agent that everyone has been talking about for a year and was developed by the American laboratory OpenAI. ” The problem with ChatGPT is that this technology only thinks and learns in English “, declared Mr. Mélenchon.

The tweet has now disappeared, but was spotted in Google’s cache on November 6. Best of all, you can hear Jean-Luc Mélenchon’s intervention live on the chatbot In full video of the conference. At that time, MP spoke about the impact of digital technology on the behavior and thinking patterns of individuals.

“ChatGPT has a big problem”

Jean-Luc Mélenchon’s exact comment about ChatGPT is as follows: ” ChatGPT has a big problem. She speaks English. She thinks if I dare to say it in English. We teach her in English and then she translates. » The left-wing leader’s intervention was aimed at alerting the public to a tool that shapes thought according to the English-speaking view of the world.

Every language has its own way of seeing reality. […] Vocabulary describes reality or it ceases to exist. » Therefore, it is good to encourage language models capable of working in different languages ​​to understand all the nuances of reality in terms of grammar, syntax and representation of the world.

To underline this reality, Jean-Luc Mélenchon took the charge that the Sami or Inuit have dozens of words to describe the snow: ” Twenty-five names to describe the state of the snow when you’re a lap “, he told the audience. An accusation we can see For other wordsBut still Criticized – Scottish Will do worse with Scots.

A snowdrop. //Source: Pixabe (cropped photo)

Regardless, this topic raised the question of a dependency on the language model based on ChatGPT (GPT-3.5 for the free version, GPT-4 for the paid version). It’s a problem that OpenAI doesn’t ignore. The US lab talks about it throughout its pages, especially in its help section on its website.

This model is geared towards western perspectives and provides better results in English », writes OpenAI on one sideand ” ChatGPT is not free from prejudices and stereotypes “. ” Bias mitigation is an area of ​​research for us, and we welcome feedback on how to improve it. »

This model is designed towards western perspectives and works best in English »


The laboratory recognizes “ The templates are optimized for use in English, but, He adds, Many of them are robust enough to produce good results in various languages “. This can be seen in French: the chatbot’s formulas give a sense of natural exchange.

A model trained mainly in English, but not exclusively

It’s accurate to say that the language models behind ChatGPT are trained on large volumes of text from the Internet, especially sites like Reddit and Wikipedia. It is fair to say that this corpus is mainly written in English, which is very common on the Internet. However, this is not the only resource used by ChatGPT.

OpenAI explains it in an article from March 14 How GPT-4 is helping to preserve Iceland’s language. In this case, the laboratory confirms ” Most of the model’s training package is available in English and other major languages “. The details of these languages ​​are not given.

According to From Dictionary to Internet, Development in Language According to linguist Michaël Abecassis, three languages ​​dominate the Internet: English, Chinese, and Spanish. In smaller proportions, we also find Japanese, Portuguese, German, Arabic, French, Russian and Korean. In 2012, French weighed about 3% of the Internet..

English is the most common language on the Internet. //Source: Mozilla

As a result, following OpenAI, ChatGPT “ Not the same skills or understanding of smaller languages “. However, ” The models improved over time “. GPT is in its fourth generation – or fifth, if we distinguish GPT-3.5, which is currently used in the chatbot in its free version.

A critical view And Want enlightenment

Before the tweet disappeared, many Internet users tried to correct Jean-Luc Mélenchon’s comments or, at least, to clarify beyond doubt the thought he was trying to create: since ChatGPT is mainly trained in English, it primarily offers a certain view of the world. Through a pair of English-speaking glasses.

Such is the state of internet users ToineSayan, knowledge, Ari Coutts, Ouranosmk Or A researcher at Inria (National Institute for Digital Science and Technology Research). Etienne Klein, also a physicist and philosopher of science, He went there with his tweetExplains that some of his works in French were used for ChatGPT training.

Beyond Jean-Luc Mélenchon’s inaccuracy in the design of SatGBT, it earned him plenty of reactions on social networks. represents an incomplete description In his committees on the subject, the president of La France Insumais invites the public to keep a critical eye on these tools and digital technology in general.

Let’s see how it works “, he told his audience, because,” This enormous digital culture that we have access to is changing the human condition ” – and also, ” The way you use it will change your brain “.clearly,” You should always be in an important relationship, but never in a fearful relationship. “. Criticism, of course, but a clarifying statement.

