Outline:
Speech technologies have touched many areas of our lives (mobile phones, smart speakers, cars, voice search, etc.), but they’ll feel foreign until we adapt them to the local communities using them daily.
How do we do that? By tackling accents and dialects across a country’s regions and adjusting the voice technology to be as seamless as talking to a friend. It’s a massive endeavor that requires a special cultural sensitivity to tackle this problem. A good start is to address the biases present throughout the whole pipeline, from the people creating the datasets to the model’s outputs, and acquire diversity to work on these stages and come up with the best solutions for its users.
The Deep Voice convention in Paris this year dived into these challenges with experts from academia and private sectors at the last day’s round table. Moderated by our CEO, Carl Robinson, and led by panelists Mathieu Avanzi (Professeur ordinaire at the Université de Neuchâtel), Sanchit Gandhi (Machine Learning Research Engineer at Hugging Face), and Maxim Serebryakov (CEO at Sanas), here’s what we learned.
They seem to be synonymous but an accent is actually a part of a larger group of features known as dialect. Mathieu makes the distinction between the two, referencing his own language, French.
A dialect comprises features that include pronunciation, grammar, and lexicon among others. Researchers take into account geographical features to determine the limits of a dialect and these regions can go beyond national borders (like the Canadian French). Every language tends to have a standard way of speaking it (e.g. Parisian French) and it’s usually the one spoken on the news on national TV.
An accent is a way to pronounce a language and its study analyzes how the vowels, consonants, prosody, and melody work together to determine its limits. Mathieu exemplifies it more clearly by showing that when you speak your non-native language (e.g. a French person speaking English) you use a certain accent borrowed from your native language. The main components of an accent (phonetics, pitch, and duration) are easily perceived in this language switch.
Specialized software, like Sanas’, detects and understands different accents, which according to Maxim is the easy part. However, the main issue lies in generating an accent in real time with low latency. This conversion is what makes this endeavor complicated.
There are a vast amount of accents and dialects which makes it very hard to look at the whole picture.
Sanchit tells us that researchers tend to focus on one part of the problem, whether it’s looking at biases in the datasets, the models, the applications, or even in the people creating these datasets. To tackle biases, the further up you go in the chain (closer to the dataset creation stage), the more impact you have on the output. The diversity of people working in these stages helps us detect these biases and make the necessary changes before it’s too late.
The more data you can gather, the better the models you can train, says Maxim. But he acknowledges that it becomes a titanic endeavor with languages that have so many dialects and accents. For instance, if you take Hindi, you have 615 million people speakers but that’s just the surface of the issue. Once you travel throughout the country, you find dozens of Hindi dialects and accents, and these changes can be very significant. A speech engine therefore can’t perform well on all speakers, even if it’s the same language addressing the user.
To mitigate bias and include a diverse community, tools have to take into account many features, and not just the most prominent ones like vowel pronunciations.
In this regard, Mathieu and his team created a comprehensive guide to the French language with their app “Le français de nos régions” (French from our regions). By asking people how they pronounced certain words in a game-like scenario, he was able to get 9,000 recordings of the same words which helped him understand the borders between dialects throughout the French-speaking regions of the world. This is the first step to detecting accents and dialects of French and using a speech recognizer for future interactions instead of having participants provide the information to determine these boundaries.
On the other side, the private sector has also done some remarkable work. For instance, Maxim showed a demo of a synthetic voice assistant for the medical industry developed by his company Sanas. In real-time, Maxim was assisted by this synthetic voice in a conversation that seemed like it was done between two real people. According to Maxim, this model allows performance with 150-millisecond latency and a 15% CPU usage that runs even in a brick of a computer, like an Intel core processor from 5 years ago (or more).
Maxim’s vision with Sanas is to help people have a choice with their voices. A worker should not be forced to change their way of communicating to hold a job, and Sanas’ technology allows them to control the way they communicate. People with different accents and the biases that arise from that array of diversity can be tackled in effectively and with the right tech, we can improve people’s situation not just at the workplace but in everyday life.
There are two fundamental concepts to take into account: community and democratization.
At Hugging Face, Sanchit and his teammates can engage closely with the community whether it’s by working in an open-source framework, sharing the datasets and models on their website, or opening the experience of their tech to their community.
Their open-science approach allows them to democratize good machine learning and have the community participate in solving the problems that pertain to them directly. They open the doors for these models to be used by anyone. With transparency and inclusion comes the democratization of speech recognition which in turn increases diversity and reduces bias.
Sanchit points out that when people experience the tech and see what the shortcomings are, they’ll be motivated to help, curate, and find how they’ll contribute to the community’s problems. If Hugging Face doesn’t work on the problems the community is facing, people won’t see value in what they do. Their company and the community are deeply intertwined; otherwise, they don’t see either of them progressing.
At the end of the day, two heads are better than one.
Speech technologies are rapidly becoming part of our everyday lives but they lack the local touch. At the second Deep Voice summit, a specific panel of experts addressed the need for more inclusion of local communities in voice assistant interactions by integrating the diversity of accents and dialects in this technology. This would also help tackle biases present throughout the whole process (from dataset creation to model predictions) which leads to a better user experience in the end. Ultimately, success lies in the details of these voices.
Watch the round-panel video