Computational Linguistics: Crash Course Linguistics #15
Hi, I'm Taylor and welcome to Crash Course Linguistics!
Computers are pretty great, but they can only do stuff that humans tell them to do.
Counterintuitively, this means that the more automatic a human skill is,
the more difficult it is for us to teach to computers.
It's easy for us to teach a computer to calculate millions of digits of pi, or play chess.
But get a computer to recognize which image contains a traffic light?
Surprisingly difficult!
The same thing goes for language.
The parts that are difficult for humans, like learning lots of new words, are easy for computers.
And the parts that are easy for humans, like understanding across typos and accents,
or knowing if someone's sad or angry or joking, are really, really difficult for machines.
Plus, language isn't just one task to teach.
It's all the different things we've talked about throughout this series and more.
Programming computers to process human language is called natural language processing or computational linguistics.
We rely on NLP for a whole range of tasks:
search engines, voice-activated home systems, spam filters, spell checkers, predictive text and more.
Today, we'll look at what NLP is and what makes language a difficult challenge for computers.
[THEME MUSIC]
Getting a computer to work with something as complex as language requires a lot of steps.
First, we need to give the computer text to work with.
We can input it directly or get the computer to transform speech sounds, handwriting,
or other physical text into digital text.
We do that with speech to text, handwriting recognition, or optical character recognition processes.
This step involves figuring out where the breaks between words and sentences go,
such as the difference between "a moist towelette" versus "a moist owlet,"
or whether a small speck is the dot of an i, a period, or a fleck of dirt.
Once it has the digital text, we then need the computer to figure out
a) the meanings of the words, and
b) the relationship between them
It might use context to disambiguate between things like "bank" and "blank",
a river bank and a financial bank, or common nouns and proper nouns.
In this step, the machine figures out approximately what is being said.
The next step is to get it to do something useful with that information,
such as answer a question, translate it into another language, or find directions between two places.
Each of these tasks also requires a different system.
All of this data gets produced in some abstract form that the computer understands,
like a mathematical equation or some lines of code.
The last step is to re-encode that data into natural human language, which can involve text generation.
Depending on what the user wants, the computer might need to produce the answer as speech,
in which case it would use text to speech and speech synthesis.
That's a lot of steps!
The nice thing about splitting up natural language processing into different steps is
that we can reuse parts of it for other tasks.
For example, if we make one system that's good at text-to-speech for English,
it can read aloud answers to questions, translations into English, and directions to go to places.
We can also distinguish between what needs to be customized for each human language and
what can always stay in computer code.
That saves programmers, and computers, some time!
Tools that perform just one or two of these subtasks can also be useful by themselves.
Automatic captioners may just do the speech to text part, screen readers may just do text to speech,
and search or translation may start with text and skip processing speech entirely!
A similar set of steps could work for signed languages too,
although this technology is very under-developed compared to what's been created for a few big spoken languages.
They could be something like:
sign-to-text, parsing signs, processing the results for a computer to work with,
and rendering the output back into signs.
We could then also create systems that interoperated between signed and spoken languages.
For example, a computer could take input in English and translate it to ASL, or vice versa.
Just like with the thousands of spoken languages, though, each of the hundreds of signed languages would still need to be supported separately.
One thing that won't really help is gloves.
Let's head to the Thought Bubble to pop that bubble.
You might have seen hyperbolic headlines about "sign language translation gloves" in the news through the years.
They claim that these gloves can “translate” American Sign Language into English speech by recognizing the wearer's handshapes.
Unfortunately, these glove makers have made several fundamental misunderstandings about how signed languages work.
One is that the grammar of signed languages isn't expressed just in the shape of the hand.
Signed languages also include facial expressions and movements of the hands and arms in relation to the rest of the body.
Two is that signed languages use far more signs than the 26 letters of the manual alphabet,
which is all the gloves can detect.
Plus, signed languages tend to use the manual alphabet to borrow technical words from spoken languages,
not for core vocabulary.
That's like making a "translation" system for English that only recognizes the words that come from Greek!
Three is that translation should enable two-way communication between hearing and deaf people,
but gloves can only translate from signs to speech, never from speech to a format accessible for Deaf and Hard of Hearing people.
Which is ironic, because the technology to produce written captions of speech already exists!
Computational tools involving signed languages could one day exist, using other input sources that can actually access full signs,
but they're never going to be any good if Deaf people aren't consulted in creating them.
And many Deaf researchers have already pointed out that gloves are just never going to accomplish that.
Thanks, Thought Bubble!
So, let's say we've created a system that's pretty good at each of the steps involved in natural language processing,
at least for one or two languages.
Does the system "understand" language the way a human does?
To answer that, let's pretend we've trained a rabbit to press buttons A, B and C in order to get a treat.
We could relabel those buttons “I”, “want”, “food”, but that wouldn't mean that the rabbit understands English.
The rabbit would press the same buttons if they were labelled something entirely unrelated.
The same goes for a computer.
If we tell a computer a few basic instructions, it can give the appearance of understanding language.
But it might fall apart spectacularly when we ask it something more complicated.
That's part of what makes teaching a computer to do language so tricky.
Originally, people taught computers to do language tasks with long lists of more and more specific rules,
such as "make a word plural by adding s".
Wait, unless the word is "child", in which case add "-ren" instead, and so on for other exceptions.
More modern approaches to machine learning involve showing computers a whole bunch of data to train them on statistical patterns
and then testing how well they've figured out these patterns using a different set of data.
A lot of recent leaps in natural language processing have come from a kind of statistical machine learning known as neural networks.
Neural nets are based on a very simplified model of how neurons work in the brain,
allowing them to figure out for themselves which factors are the most relevant in the training data.
But because they work out these factors for themselves,
it's hard for humans to know exactly what patterns they're picking up on.
Early in a neural net's training, it will make really silly, non-human-like errors,
like returning a text "eeeeeeeee" because it's worked out that "e" is the most common letter in English writing.
The machine will keep adjusting itself based on the training data, though,
and eventually it starts returning things that look more like words.
Well, almost.
In any kind of machine learning, training data is really important,
and there are two kinds of data we can use.
The first is data with two corresponding parts that have been matched by humans,
such as text with audio, words with definitions, questions with answers, sentences with translations,
or images with captions.
Using parallel data like this is known as supervised learning, and it's great,
but it can be hard to find enough data that has both parts.
After all, some humans have to create all of these pairs.
The second kind of data has only one component, like a bunch of text or audio or video in one language.
Using this kind of non-parallel data is known as unsupervised learning.
It's much easier to find, but it's harder to use to train a computer,
since it has to learn from only half of the pair.
So researchers often use a mix of both:
a smaller amount of a parallel data to get things started, and then a larger amount of non-parallel data.
This combination is called semi-supervised learning.
But none of this data just magically appears.
It gets created or gathered by humans, and humans have all sorts of biases.
Computer science researcher Harini Suresh created a framework to evaluate bias in machine learning.
We can use this framework to see how bias affects the language tools we've discussed in this episode.
First, historical bias is when a bias in the world gets reflected in the output the computer produces.
For example, Turkish doesn't make a gender distinction in any of its pronouns,
whereas English does in the third person singular, between he, she, it, and singular they.
So a translation system might pick a gender for pronouns when translating from Turkish to English,
making "he is a doctor" but "she is a nurse" for the same Turkish pronoun.
This might reflect an overall tendency in the world, but our computer is still producing a gender bias!
Next, representation bias is when some groups aren't as well represented as others in the training data.
For instance, while researchers estimate that at least 2000 languages are actively being used on social media,
only a few large languages are well-represented in language tech tools.
The rest are barely represented or left out, including all signed languages.
When the features and labels in the training data don't accurately reflect what we're looking for,
that's measurement bias.
The text that has been translated into the most languages is the Bible,
so it's often used as training data.
But the style of language in religious texts can be very different from day-to-day conversation,
and can produce strange results in Google Translate.
Aggregation bias is when several groups of data with different characteristics are combined and a single system isn't likely to work well for all of them at once.
If we smushed all the varieties of English into training data for an “English” speech-to-text program,
it could end up working better for Standardized English than, say, African American English.
Evaluation bias occurs when researchers measure a program's success based on something users won't find useful.
Researchers with an English-first mentality might focus on whether a predictive text program predicts the next word,
whereas a program that predicts the next morpheme would work better for languages with longer words and more morphemes.
When a system was originally created for reasonable purposes but then gets misused after its release,
that's deployment bias.
Style analysis tools can be used to determine whether a historic figure wrote an anonymous book,
but they can also be misused to identify anonymous whistleblowers.
Being aware of these sources of bias is the first step in figuring out how to correct for them.
Like the whole field of computational linguistics, addressing these biases is an active area of research.
We have a responsibility to use our increased understanding of language through linguistics
to deeply consider the effects we have on each other and on the world we live in.
This ethical consideration is especially important in computational linguistics,
because we interact with technology so much in our daily lives.
Next time, we'll talk about a much older kind of language technology which is so common that we might not even think of it as a technology:
writing systems.
Thanks for watching this episode of Crash Course Linguistics.
If you want to help keep all Crash Course free for everybody, forever,
you can join our community on Patreon.