This article is from the November 2023 issue of Science et Avenir – La Recherche N°921.
“I can recognize languages and create text in hundreds”, responds when asked about ChatGPT’s language capabilities. But Artificial Intelligence (AI) refers to: “My ability and accuracy varies greatly from language to language. “In fact, researchers at the University of the Basque Country (Spain) have shown that large language samples are more efficient when asked questions (“prompts”) in English.
Seven derivatives of XGLM and LLaMa have trained and evaluated the technologies of Meta’s Artificial Intelligence Lab (Facebook’s parent company), XCopa (logical reasoning), PAWS-X (paraphrase recognition) or MGSM (MGSM). mathematics). “Oases” were written in each language taken into account in each dataset. On the other hand, the answers were generated in two ways: on the one hand, the model answered “quickly” in the language, and on the other hand, “quickly” was automatically translated into English by the model, who then answered in English.
Looking at the results compared, the results are clear: “The models excelled in English in all tasks, Julen Etzanis, a language processing expert and co-author of the study, notes. It also compensates for translation errors made when converting these samples from the original language to English. “The key explanation, according to the researcher: These technologies are taught mainly in English-speaking texts. “For the most multilingual models (XGLM and Bloom), more than 30% of training games are English-speaking “, adds Julen Etzanis.
The widely used English Wikipedia
OpenAI, the company that developed ChatGPT, uses the entire English-speaking Wikipedia, a large corpus of web pages called CommonCrawl, databases of digitized books, and online forums. For Bart Chatbot (Google), 50% of the data comes from forums and 12.5% from English-speaking Wikipedia. But the details are not known. “The rare information provided by the developers of this type of model is often limited to referencing ‘data obtained from the Internet.’ Giada Pistilli of artificial intelligence specialist The Hugging Face notes. Given the dominance of the English language on the Internet, it is reasonable to conclude that the majority of their training data is English. “Hence the better expression and greater relevance of the answers given in this language.
But training is not the only explanation: assessment tests play a significant role. “They make it possible to measure the model’s ability to respond effectively to different scenarios and performance requirements. However, they are essentially designed. In English “, Giadai Pistilli explains. Reinforcement learning based on user ratings is added to these models (to reduce errors, biases, shocking content, etc.). These operations are also often performed in English. When describing this work applied to GPT-3 in a 2022 article, this rating data is at least 96% OpenAI recognizes that English can be spoken.
Risk of error in other languages
And the effects aren’t just performance. “Because ChatGPT is mostly trained on English text taken from the Internet, we notice that responses are more aligned with American culture than others, July warns co-author Yong Gao A study on the subject at the University of Copenhagen (Denmark).. Clearly, asking the model about China in Chinese will give wrong answers. “
Giada Pistilli made a similar observation at GPT-3 two years ago, and underscores the danger these tools pose of spreading a highly Americanized view of the world. “The scientific and technological community today is aware of this problem, She promises, And efforts, albeit timid, are being made to remedy this. “
LLM (Large Language Model)
GPT-4 (the technology that makes ChatGPT work), LaMDA (by Bard, Google), LLaMA (Meta) or Bloom (open source technology) are what we call Large Language Models or LLM (Large Language Model). This concept refers to the huge amount of text data used for their machine learning and the number of model parameters, i.e. variables that can be weighted and adjusted to improve the algorithm’s performance.
LLMs are based on the neural network framework presented in 2017 and were originally designed for machine translation, transformers. They know how to process data whose meaning depends on the order in which they are arranged. Generally: Words in a sentence. But the algorithm does not process this data in this order and can perform multiple treatments in parallel with the effect of going fast.
“Beeraholic. Friend of animals everywhere. Evil web scholar. Zombie maven.”