Every language has one. The kind of hot thing that rolls off a native tongue all sweet, but presses into your own ear jagged, curling your hair and making your skin itch. Some kind of clitic that ends a party, a string of morphemes that’s chin music. An optative or an elative come out and you wish you could get the hell out. Even meek language learners feel a savageness when the strangeness comes around.
How much is lost in translation when we try to process only in English? Perhaps 90% of academic and commercial Natural Language Processing has focused only on English. If you are trying to find broad topics this might not matter, but if you are trying to identify all the subtle (or not so subtle) metaphors, sentiment and emotion, translating into English will often strip away the very phenomena you are most interested in.
|Translation||Literal meaning||Machine translation|
|L’étoffe dont sont faits les rêves.||“The material of which are made the dreams”.||“The stuff that dreams are made of”|
|Dolgok, amikről álmodunk.||“things, about which we dream”.||“Things I dream of.”|
|Rüyalarin yapildigi maddeden.||“dreams’ were-made material-of”.||“Material dreams are made.”|
|Translations for “The stuff that dreams are made of” in French, Hungarian and Turkish|
In the examples above, how much of the full impact “the stuff that dreams are made of” is lost in translation? Only French machine translation turns it back into the correct English, but we suspect that this is because it knows the famous quote. Imagine the full range of expressions in English that would lose their punch when translated: “bare your heart”, “give up”, “beside yourself”, and realize that every single one of the world’s languages has an equally rich set of expressions and idioms that cannot be adequately translated, by humans or machines.
This is why we need intelligent Natural Language Processing that works within each language, not just with translations: it is often the most emotionally charged expressions that cannot be translated.
For this post we’ll break down this one example, taken from the most famous line from The Maltese Falcon. We choose this among all possible idioms or expressions for reasons close to our hearts: a month or so ago we moved into our new offices five floors above where the author Dashiell Hammett worked as a private eye.
A police detective picks up the Maltese Falcon statue and notes how heavy it is. “What is it?” he asks Sam Spade.
The, uh, stuff that dreams are made of.
Let’s take the lid off and see the works. We’re going to use the translations that actually appear in subtitles, courtesy of OpenSubtitles via Jörg Tiedemann’s OPUS corpus. (None of them choose to translate the uh, which is a bit sad since it’s one of the stronger stylistic markers). The machine translations are from a well-known search engine.
We’ll start with a pretty easy one. French is a broadly spoken language and since it is related to other widely spoken languages like Spanish and Portuguese, odds are that it won’t be all that foreign to you.
L’étoffe dont sont faits les rêves.
This is something like “The material of which are made the dreams”.
The word étoffe in French means ‘material’. It’s a feminine noun, which you might guess from the final –e (though that’s not really a sure-fire indication). Normally, you could tell based on the article, but since the word starts with a vowel you turn la just to l’. (The idiom il manque d’étoffe means ‘he lacks personality’, btw.)
Gender systems are pretty common around the world, not just in Indo-European languages. For example, Bantu languages across Africa have lots of genders—often between 7-10. What gender means for language learners and computational linguists is that we have to pay attention to a noun’s classification in order to know how to do stuff with it (like pluralize it) and how to handle agreement with other words like adjectives and verbs. In general, the more genders a language has, the more word forms there are that correspond to what we might want to call “the same” word.
Let’s press on. The dont is a ‘relative pronoun’ that indicates possession, so it could be translated as ‘of which’, ‘from which’.
The verbal ‘are made’ meaning is found in sont faits. The first of those words is the third-person plural present tense for ‘to be’ . Faits is from the verb faire, ‘to do’. They agree in plurality—if we were talking about the stuff that a dream was made of, we’d have est fait. In language-after-language, the verbs ‘to be’ and ‘to do’ are painfully irregular. Well, painful for the language learner. If you’re a native English speaker, when was the last time you said I am’ed or he do’ed? Frequency helps you learn (and it helps the form escape the grinding power of regularization).
Finally, les rêves are ‘the dreams’ (the singular is le rêve). That’s pretty straight-forward, so I won’t say anything more about it.
In Hungarian, the line is a bit more like “things, about which we dream”.
Dolgok, amikről álmodunk.
The word for ‘thing’ in Hungarian is dolog. But if you want to pluralize it, you don’t get to just add a letter at the end. Instead, you have to flip some stuff around: dolgok. There’s some fun linguistic processes at work here, so let me know in the comments if you’re interested.
Ami is the way you say ‘which’ and the k in the middle is like the k at the end of dolgok, a marker of plurality. Now, about the ending: Hungarian has a nearly-limitless supply of affixes. You add ről to indicate ‘off’ or ‘about’. Check out this link to go have your mind boggled by the major noun cases: http://www.hungarianreference.com/Nouns/. (A “case” is basically one way that a language might keep track of which words are related to which other words in what kinds of ways—for example, a nominative case marker roughly means something is the subject of the sentence and an accusative roughly means something is the object of a sentence.)
Most of the case suffixes have two forms. That’s because Hungarian has what’s called “vowel harmony”. Vowel harmony is the phonetic equivalent of “don’t wear stripes with leopard prints”. It means that you need to make the vowel in a suffix match with the noun’s last vowel. But by “match”, I don’t mean “be identical to”. Open up your mouth and say a bunch of vowels a few times—you’ll notice that some of them happen in the front of your mouth and some of them in the back. That’s what matters in Hungarian. Other languages harmonize other things, sometimes at quite some distance (meaning that there are other consonants and vowels that may intervene in between the two things that depend upon each other).
The verb álmodik is ‘to have a dream’, but you have to conjugate it. The form álmodunk is for ‘we dream’…except that Hungarians like to mess with your mind so there are actually two different ways to say ‘we dream’. The –unk ending indicates that there’s no definite object that the verb is about. Otherwise, if you wanted to say we dreamed some particular dream, then you’d need to use the –juk ending.
In Turkish, the line is something like “dreams’ were-made material-of”.
Rüyalarin yapildigi maddeden.
For this, I’ll break it down word by word:
- Rüya is ‘dream’
- –lar is the plural
- –in means that the dream owns something (‘defined genitive case’)
- yap is a root of the base verb (yapmak), ‘to make, to do’
- –il is the passive
- –di is the past tense
- –gi…oh, gi. I’m going to talk about gi in a moment.
- madde is ‘material, substance’
- –den is ‘of’ (though it is also sometimes ‘to move away from, by, via’)
Okay, you know how you hear people using impact as a verb (it used to just be a noun). Languages have all sorts of ways to change parts of speech. Sometimes you just take a word and leave it as-is (like impact), but other processes work, too (noun-ify is a verb from a noun, noun-y is an adjective from a noun, nouniness is a noun from an adjective from a noun).
In Turkish, the –gi turns a verb into an adjective. In this case, that lets it get tied to a noun. You can’t just use –gi willy-nilly, though. You can only use it with some conjugations. (Fwiw, if you drop the noun that the adjectivized verb is modifying, then you can use it as a noun instead and keep on appending affixes.)
Turkish is also a great example of why ‘keyword’ based Natural Language Processing is not sufficient in many languages, as most of the action is happening within the words, but we’ll leave more about suffixes and prefixes for another post.
One of the reasons this Turkish translation is good is because it evokes the standard Turkish translation of Shakespeare. Part of what you might hear in Sam Spade’s line is from The Tempest: “Leave not our rack behind. We are such stuff / As dreams are made on; and our little life / Is rounded with a sleep”. In Turkish, the middle part is ruyalarin yapildigi maddeden yapilmayiz biz…”, so the subtitle gets to evoke it for Turkish speakers, too.
Now that you’ve been vexed on your tongue and troubled in your brain, we’ll sign off. Go still your beating mind.
– Tyler Schnoebelen (@TSchnoebelen)
ps–Thanks very much to Bence Farkas and Ali Alpay for their help!
pps–The line we’ve worked on here is probably one of the most famous in film noir…but it actually doesn’t appear in Dashiell Hammett’s story.