Why Machine Learning Still Needs Humans for Language

Outperforming Humans

Machine Learning (ML) has begun to outperform humans in many tasks which seemingly require intelligence. The hype about ML has even made it regularly into the mass media, and it can now read lips, recognize faces, or transform speech to text. Yet when it comes to dealing with the ambiguity, variety and richness of language, or understanding text or extracting knowledge, ML continues to need human experts.

Knowledge is Stored as Text

The web is certainly our greatest knowledge source. However, it has been designed for consumption by humans, not machines. The web’s knowledge is mostly stored in text and spoken language, enriched with images and video. It is not a structured relational database storing numeric data in machine processable form.

Text is Multilingual

The web is also very multilingual. Recent statistics surprisingly show that only 27% of the web’s content is in English, and only 21% in the next 5 most used languages. That means more than half of its knowledge is expressed in a long tail of other languages.

Constraints of Machine Learning

ML faces some serious challenges. Even with today’s availability of hardware, the demand for computing power can become astronomical when input and desired output are rather fuzzy (see the great NYT article, “The Great A.I. Awakening“).

ML is great for 80/20 problems, but it is dangerous in contexts with high accuracy needs: “Digital assistants on personal smartphones can get away with mistakes, but for some business applications the tolerance for error is close to zero”, emphasizes Nikita Ivanov, from Datalingvo, a Silicon Valley startup.

ML performs well on n-to-1 questions. In facial recognition, for instance, there is only one correct answer to the question “which person do all these pixels show?” However, ML struggles in n-to-many or in gradual circumstances…there are many ways to translate a text correctly or express a certain piece of knowledge.

ML is only as good as its available relevant training material. For many tasks, mountains of data are needed, and data that should be of supreme quality. For language related tasks these mountains of data are often required per language and per domain. Furthermore, it is also hard to decide when the machine has learned enough.

Monolingual ML Good Enough?

Some would suggest we should just process everything in English. ML also does an ‘OK’ job at machine translation (Google Translate, for example). So why not translate everything into English and then simply run our ML algorithms? This is a very dangerous approach, since errors multiply. If the output of an 80% accurate machine translation becomes the input to an 80% accurate Sentiment Analysis, errors multiply to 64%. At that hit rate you are getting close to flipping a coin. 

ErrorsMultiply

Human Knowledge to Help

The world is innovating constantly. Every day new products and services are created. To talk about them, we continuously craft new words: the bumpon, the ribbon, a plug-in hybridTTIP ‒ only with the innovative force of language can we communicate new things.

A Struggle with Rare Words

By definition, new words are rare. They first appear in one language and then may slowly propagate into other domains or languages. There is no knowledge without these rare words, the terms. Look at a typical product catalog description with the terms highlighted. Now imagine this description without the terms – it would be nothing but a meaningless scaffold of fill-words.

Knowledge Training Required

At university we acquire the specific language and terminology of the field we are studying. We become experts in that domain. Even so, later when we change jobs during our professional career we still have to acquire the lingo of a new company: names of products, modules, services, but also job roles and their titles, names for departments, processes, etc. We get familiar with a specific corporate language by attending training, by reading policies, specifications, and functional descriptions. Machines need to be trained in the very same way with that explicit knowledge and language.

NoKnowledgeWithoutTerms

Multilingual Knowledge Systems Boost ML with Knowledge

There is a remedy: Terminology databases, enterprise vocabularies, word lists, glossaries – organizations usually already own an inventory of “their” words. This invaluable data can be leveraged to boost ML with human knowledge: by transforming these inventories into a Multilingual Knowledge System (MKS). An MKS captures not only all words in all registers in all languages, but structures them into a knowledge graph (a ‘convertible’ IS-A ‘car’ IS-A ‘vehicle’…, ‘front fork’ IS-PART of ‘frame’ IS-PART of ‘bicycle’).

It is the humanly curated Multilingual Knowledge System that enables Machine Learning and Artificial Intelligence solutions to work for specific domains, with only small amounts of textual data, including for less resourced languages.