Madagascar has two official languages: French, owing to the country’s former status as a colony, and Malagasy, an indigenous language spoken by the majority of its citizens. Growing up in Madagascar, Nathanaël Rakotonirina was required to speak French in school, but defaulted to Malagasy at home. These days, as a 26-year-old student researching neural nets, he is immersed in large language models, the generative artificial-intelligence technology underlying ChatGPT. If he wants to get the best performance out of these AI models, he has to converse in English.
Large language models, or LLMs, are not as adept in other languages, particularly overlooked ones such as Malagasy. “I think Malagasy was virtually not considered in any language model,” Mr. Rakotonirina said. GPT-4, which was released by OpenAI earlier this year, is an improvement, but the model’s Malagasy fluency still does not compare with its English.
As generative AI has swept the globe, most of the world’s 7,000 spoken languages appear to be an afterthought. The field of natural language processing, which refers to how computers understand language, has historically focused on about 20 of them. Low-resource languages, in contrast, are underrepresented, with fewer datasets for developers to work from.
As for LLMs, these models are built in part with data scraped from the internet, and nearly 64 per cent of websites are in English, despite the fact that only 26 per cent of internet users speak the language.
“These are English-first models. They don’t do well in other languages, and this reflects where they’ve been built around the world,” said Sara Hooker, the head of Cohere For AI, a non-profit research lab affiliated with Toronto-based artificial-intelligence company Cohere. The leading developers of LLMs to date – OpenAI, Google, Meta Platforms Inc., Anthropic and Cohere – are headquartered in North America.
Earlier this year, Ms. Hooker embarked on an effort to build an LLM called Aya that can converse in 101 languages, including Somali, Yoruba, Japanese, Malay, Telugu and Vietnamese. By early next year, Cohere For AI aims to release the model and the language dataset for anyone to use.
For Ms. Hooker, a former AI researcher at Google, the Aya project is not just about language representation, but a call to arms for open-source research. Artificial intelligence, like all sciences, has benefited from the free exchange of resources, methods and findings. But competitive pressures have pushed some AI companies to clam up. “Aya is a commitment to open source from the get-go, and to do science out in the open,” she said.
More than 1,000 people have been involved with Aya in some way, with a core team of about 80. Many contributors are independent researchers and computer-science students. Mr. Rakotonirina, who is pursuing a PhD at Pompeu Fabra University in Barcelona, joined Aya as a language ambassador for Malagasy earlier this year. “This is going to be a big jump,” he said, “and I think it’s going to exceed GPT-4 for these low-resource languages.”
In order to “learn,” the Aya model requires thousands and thousands of text examples in various languages. Existing multilingual datasets are generally poorly translated and not well suited to AI model training, Ms. Hooker said. To help improve them, volunteers can rate the quality of text examples and make edits, checking for coherence and grammar. Volunteers also contribute new examples in the form of a question and answer, and are encouraged to submit 4,000 unique pairs per language.
In his role, Mr. Rakotonirina has rounded up other Malagasy speakers, and his top 20 contributors are primarily former classmates from the Université d’Antananarivo. He also contributes Malagasy examples himself. Sometimes he writes whatever pops into his head, but he usually consults textbooks, exam questions or Wikipedia, hopping to the page about politics in Madagascar to extract a question and then providing a lengthy answer.
He’s hoping the completed dataset will be useful for developers, particularly for speech recognition applications. That’s important in a country where a large number of people are not considered literate. According to the World Bank, the literacy rate in Madagascar for men is 78.8 per cent and 75.8 per cent for women. “Most of these people could interact with the model just by talking,” Mr. Rakotonirina said, “but people cannot really build something if there are no datasets.”
Literacy gaps are a motivator for Mouhamadane Mboup, too, a language ambassador for Wolof, which is spoken by about 12.4 million people. Many native speakers are in Senegal, where he lives. “It’s very important for our community to be part of AI, and not be behind in technology,” he said.
Mr. Mboup has rounded up 22 Wolof contributors whom he nudges through WhatsApp. “I tell them about the importance of the project, and all the opportunities we can gain,” he said. On weekends, he and his fellow Wolof ambassador hang out at a restaurant and spend an entire day writing Wolof text for the dataset.
Telugu language ambassador Surya Guthikonda, who lives in India, has found that some chatbots are decent in his native tongue, but still fall behind English capabilities. Plus, the models lack an understanding of the cultural nuances of Telugu. A poetry genre known as Sathakam was a large part of his education growing up, he said, but knowledge of the topic is noticeably sparse in LLMs. “In Aya, we made sure to include as many Sathakams as we could find,” he said.
Text-to-speech applications powered by AI are “gibberish” when it comes to Yoruba, said Olanrewaju Samuel, an Aya volunteer from Nigeria and a masters student at the University of Toronto. Many words in Yoruba, a language spoken by more than 20 million people in Nigeria, Benin and Togo, can have multiple meanings. To take one example, “oko” can refer to husband, vehicle or hoe depending on the tone used by the speaker and the diacritical marks applied in written form. That level of detail is missing from many applications, Mr. Samuel said. “The problem is most of them don’t collaborate with native speakers,” he said, referring to other developers.
The lack of representation for some languages also imperils the safety features developers have attempted to build in. OpenAI and its ilk try to ensure chatbots don’t provide harmful information or make racist or discriminatory comments (though language models still suffer from these issues to some extent). Ask ChatGPT in English how to build a bomb, and the application will likely reply, “Sorry, but I can’t assist with that.”
A study from Brown University published in October found these safety measures can be easily bypassed if the request is first translated into a low-resource language. GPT-4, the language model that powers ChatGPT, refused more than 99 per cent of the researchers’ attempts to elicit harmful responses in English, but complied about 79 per cent of the time when it came to other languages such as Zulu or Scots Gaelic, including requests related to financial scams, racial and gender discrimination, and eating disorders.
The authors called the results “alarming,” illustrating the “harms of unequal valuation and unfair treatment of languages” in AI. (The lead researcher also works on Aya as part of its responsible deployment team.)
Ms. Hooker at Cohere For AI hopes that Aya will serve as a wake-up call for industry and governments to consider language representation, and support open-source development. Her commitment to open source has made the entire project a somewhat gruelling endeavour. “It has been brutal,” she said, adding that she hasn’t been able to find ambassadors for some European languages. “Part of it is that I suspect Europeans consider their languages well resourced, when in fact they aren’t,” she said.
Rather than trying to co-ordinate multiple teams and volunteers around the world to build Aya, it would have been far easier to appoint a small handful of engineers. But Ms. Hooker believes in open science, and is concerned about the creeping secrecy and centralization of the AI industry. “It requires a conversation at an international level about how we put resources in place for breakthroughs to happen outside of these small pockets that I can name on less than 10 fingers,” she said. “Aya is as much a protest against that as it is a protest against the state of technological progress for other languages.”