Isaac Newton vs. millions of Japanese teens

Posted 07.02.2014

Less than 1% of languages have spontaneously developed a visual form. Most languages have never been written, and even English became a written language only relatively recently. Far fewer languages have developed their own writing system–almost all of them have adapted and borrowed their writing system from other languages.

There is clearly something inherently verbal about language that has made visual representations so rare. Despite the absence of language visualizations historically, it’s a hot topic now. That’s because people need to find ways to represent human communication at never-before-seen scales. The cutting edge was on display at the Workshop on Interactive Language Learning, Visualization, and Interfaces at the Annual Meeting of the Association for Computational Linguistics in Baltimore, last week.

We are proud to have sponsored the workshop by awarding travel grants (see announcement for more details).

ACL 2014. Language visualization in Baltimore.

ACL 2014. Language visualization in Baltimore has not been this popular since The Wire

Isaac Newton vs. millions of Japanese teens

Visualizing language is complicated. See our post on Petulant Punctuation where even the humble period “.” at the end of a communication can drastically change its tone. It is also very beautiful and diverse. See our post in 7 Awesome Non-English Punctuation Symbols You Need. And punctuation is just the smallest step beyond visual representation of verbal speech.

When we look at other symbols in language, it gets messy very quickly. As I have said before, I believe that Unicode is the most sophisticated visualization framework, ever. The characters render across phones, tablets and computers, globally. The characters are as intelligible on a tiny phone display as they are on the giant screens in stadiums.

It is easy to forget that the Unicode standard did just not appear on our devices by magic. It is the result of a small number of very smart people. If you look at http://unicode-table.com/, every ordering of charters and mapping to binary forms was a conscious decision. For a short time, I was a liaison to the Unicode Consortium, which is the organization that decides on the standard for gets to officially become a character in digital communications. My influence on the current standard is near zero, and I was mainly a passive observer on email chains where many decisions were made. However, I was witness to when Japanese ‘emoji’ were being added to the standard. This taught me how complicated and tense the process could be.

Emoji are picture characters that are particularly popular in Japan, especially among teenagers using phones. In the last few years they are becoming more widespread. Some examples that should display in your browser are: ⚔⚘☃☂.

As Wikipedia puts it, “[the addition of Emoji] went through a long series of commenting by members of the Unicode Consortium and national standardization bodies of various countries”.

This is true, but “a long series of commenting” fails to capture the intensity of the actual debate. I took a few hundreds emails sent on Unicode mailing list about emoji, and compared them to emails on the list from the same period that were not about emoji. So, to get very meta, I created a language visualization of the discussions around emoji’s inclusion in the Unicode standard. Here are the words that were most indicative of the conversations about emoji vs. other topics (using point-wise mutual information):

The keywords used by the Unicode Consortium when debating the inclusion Emoji into the Unicode standard.

The keywords used by the Unicode Consortium when debating the inclusion Emoji into the Unicode standard.

“Specious”, “stupid”, “abuse”. It’s not surprisingly that the word ‘emoji’ was popular in the discussions, but clearly there was blood on the walls in the debate about whether cute icons should be considered language.

Isaac Newton’s notebooks were being encoded at the same time. I attend the meeting where the Unicode standards for Newton’s symbols were debated, fittingly under the biggest apple possible: Apple’s headquarters. There was one particular symbol that occurred only once: a handwritten scrawl that probably related to an element and/or valence, but the meaning was lost in time. The final standard and symbols can be seen at the The Chymistry of Isaac Newton.

Compare the debate around emoji, used by millions of people daily, to the debate around a symbol of unknown meaning that occurred exactly once in human history. And that debate was…

Sir Isaac Newton

Sir Isaac Newton

… crickets. There was no argument at all. The only debate was whether this one mystery symbol was truly unique or a variant representation of another symbol. For some reason, it held higher status than millions of messages sent each day by teenagers, at least in terms of a standard, but it was difficult to pin down exactly why people have such strong, and conflicting, views about how to best represent language.

Because it’s hard

In Marti Hearst’s presentation at the workshop, she offered this simple explanation for why visualizations for language have not been standardized and why interfaces for search engines have barely changed in the last decade:

“Because it is hard”

The debate around emoji, and lack thereof around Newton’s scribbling, from the leading experts show how this is true.

Workshopping what language looks like

The Workshop on Interactive Language Learning, Visualization, and Interfaces tackled this complicated topic from a number of directions.

It was interesting to see both incredible variety and imagination alongside a difficulty in rising above the simple ‘word cloud’ for topics. This was one of the fieriest debates that was repeated through the day: broadly, people offering variations on word cloud technologies, and people offering arguments about why word clouds are not an effective tool for conveying information–the ’3D pie chart’ of text visualization, as some would claim.

The graphics below taken from the papers show there is much more to this field:

Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation
Angel Chang, Manolis Savva, Chris Manning

Angel Chang, Manolis Savva, Chris Manning

Angel Chang, Manolis Savva, Chris Manning

GLANCE Visualizes Lexical Phenomena for Language Learning
MeiHua Chen, Shih-Ting Huang, Ting-Hui Kao, Hsun-wen Chiu, Tzu-Hsi Yen

Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis
Glen Coppersmith, Erin Kelly

SPIED: Pattern Based Information Extraction and Diagnostics
Sonal Gupta, Chris Manning

Interactive Exploration of Asynchronous Conversations: Applying a User-Centered Approach to Design a Visual Text Analytic System
Enamul Hoque, Giuseppe Carenini, Shafiq Joty

Enamul Hoque, Giuseppe Carenini, Shafiq Joty

Enamul Hoque, Giuseppe Carenini, Shafiq Joty

Design of an Active Learning System With Human Correction for Content Analysis
Nancy McCracken, Jasy Suet Yan Liew, Kevin Crowston

MiTextExplorer: Linked brushing and mutual information for exploratory text data analysis
Brendan O’Connor

Brendan O’Connor

Brendan O’Connor

LDAVis: A Method for Visualizing and Interpreting Topics
Carson Sievert, Kenneth Shirley

Concurrent Visualization of Relationships Between Words and Topics in Topic Models
Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

Hiérarchie: Visualization for Hierarchical Topic Models
Alison Smith, Timothy Hawes, Meredith Myers

Alison Smith, Timothy Hawes, Meredith Myers

Alison Smith, Timothy Hawes, Meredith Myers

MUCK: A Toolkit for Extracting and Visualizing Semantic Dimensions of Large Text Collections
Rebecca Weiss

Active Learning With Constrained Topic Model
Yi Yang, Shimei Pan, Doug Downey

The invited speakers were:
Chris Culy (Universität Tübingen)
Marti Hearst (University of California, Berkeley)
Jimmy Lin (University of Maryland, College Park)
Noah Smith (Carnegie Mellon University)
Krist Wongsuphasawat (Twitter)

And the conference was organized by:
Jason Chuang (UW)
Spence Green (Stanford)
Marti Hearst (UC Berkeley)
Jeffrey Heer (UW)
Philipp Koehn (Edinburgh)

Whether it’s a quick graph for understanding linguistic trends, or symbolic representations of language for the ages, we look forward to more from these researchers and similar workshops in the future!

Rob Munro
WWRob

Read more

Congratulations to Travel Grant Recipients

Posted 06.21.2014

Idibon is sponsoring the Workshop on Interactive Language Learning, Visualization, and Interfaces on June 27, 2014, at the Annual Meeting of the Association for Computational Linguistics.

We are excited to award travel grants to two students with papers accepted to the workshop:

Enamul Hoque, University of British Columbia

Yi Yang, Northwestern University

And to an invited speaker:

Chris Culy of Universität Tübingen

Chris Culy will be speaking about Learning with MOTHs, drawing on a wide variety of work in mathematical linguistics, fieldwork in Mali, theoretical morphosyntax, machine translation and summarization. The student travel award recipients are presenting the following co-authored papers:

Interactive Exploration of Asynchronous Conversations: Applying a User-Centered Approach to Design a Visual Text Analytic System. Enamul Hoque, Giuseppe Carenini and Shafiq Joty
Active Learning With Constrained Topic Model. Yi Yang, Shimei Pan and Doug Downey

We congratulate them and look forward to seeing all three present at the workshop!

Rob Munro
@WWRob

Idibon travel award

Read more

Europe’s 1000 Words for Parliament

Posted 05.27.2014

Understanding the diversity and variation of language is key to understanding languages globally. English has many variations for the word ‘parliament’, from the simple plural, ‘parliaments’, to the word conveying the processes/properties, ‘parliamentary’, and the more complex word stating an absence of advocacy, ‘unparliamentarianism’.

If we extract all the variations of the word ‘parliament’ from the transcripts of the European Parliament we see about 40 different variations on the word in English:

The two longest words, ‘deparliamentarization’ and ‘Europarliamentarianism’, are topical right now, with anti-European Union populists gaining seats in the European Parliamentary Elections. Or in other words, the European Parliament’s newest parliamentarians are calling for deparliamentarization of the parliament due the excessive Europarliamentarianism of their parliamentary colleagues.

Personally, I find this kind of variation (known as ‘morphological variation’) more interesting than looking for unrelated words with similar meanings (as in ’50 words for snow’). Different languages can apply affixes to words to subtly (or not so subtly) alter a words meaning in very different ways, with each language allowing a distinct set of tools for talking about the same subject.

It is interesting to look at this variation across languages. Before you rejoice in the richness of the English, you should know:

English is one of our least complex languages.

Or to be more precise, it has far fewer affixes than almost any other language. In fact, English is often called morphological impoverished.

When we look to French, we see a few more variations:

And if we look to the Spanish transcript from the European Parliament we see double the complexity of English:

The French and Spanish transcripts get some of the additional variation from the gender system common to Romance languages. For example, Spanish has both the feminine ‘parlamentarias’ and masculine ‘parlamentarios’. These double again in combination with prefixes, for example, ‘europarlamentarias’, ‘europarlamentarios’.

The pattern holds outside of latin script, with substantial variation in Greek, too:

More complicated still, here’s Lithuanian:

Lithuanian is typical, globally, in its complexity. It doesn’t have too many more affixes than Greek, Spanish or French. It is just that a few more potential affixes greatly add to the number of possible combinations.

Among the famously complex languages we have Hungarian (Magyar):

And we finish with Finnish (take a deep breath):

There are more than 600 different spellings for ‘parliament’ in Finnish. It does not have grammatical gender, like the Romance languages, but it does have a large number of affixes for representing locations. In English, we would use prepositions like ‘in parliament’ or ‘from parliament’. Finnish adds a suffix on the word itself, like ‘parlamentissa’ or ‘parlamentilta’. Finnish also has suffixes that correlate to whether a noun is the Subject or Object. English largely relies on word-order to express and understand Subjects and Object, and so Finnish is freed up to rearrange where the Subject and Object occur in a sentence, often making the Object come earlier in a sentence when it is more salient. These examples, are barely scratching the surface, and even linguists have not yet unraveled all the complexities of Finnish. For a starting point, see the Wikipedia page on Finnish Noun Classes

Why keywords are not enough

To pull it all together, here’s a simplified view of English, Lithuanian and Finnish:

Variations on 'parliament' in English, Lithuanian and Finnish

Variations on ‘parliament’ in English, Lithuanian and Finnish

This variation has important implications for how we interact with technology (remember that Lithuanian is typical, globally). When you type a keyword like ‘parliament’ into a search engine, you expect it will also give you results for ‘parliaments’ and probably a few other variations. Even if it did not, it wouldn’t really matter: in the transcripts used here, ‘parliament’ accounts for about 90% of all uses. You probably don’t want results for ‘deparliamentarization’ when you search for ‘parliament’, in any case.

For languages other than English, it’s not so simple. The most common forms still account for about 50% of all uses, but that means that a keyword search is going to miss a lot. For most of the world’s languages, the leading search engines do not encode any of the variation. So the search results are much poorer for other languages.

This is true for language technologies more broadly. The majority of language technologies are built first for English, with people often thinking ’90% is good enough’ for English, not realizing that they are only addressing less than 50% of the problem world-wide. This is true for everything from search engines to spam filters, sentiment analysis, and question-answering systems.

If you are building information systems for global users, think carefully about your assumptions. You can code yourself into a corner if you assume that ‘whitespace’ is also you need to extract and index words. Whether it is for health, education or access to market information, understanding the world’s communications require understanding of the languages’ structures, right down to the boundaries of each word.

Rob Munro
@WWRob

Notes:
1. The data was taken from the ‘Europarl’ corpus. See http://www.statmt.org/europarl/, with thanks to Philipp Koehn.
2. There are a few words that are probably typos or borrowings, like ‘CommissionParliament’ and ‘Parlamento’ in the English visualization. In order to be quick and because I don’t speak every language above (sorry) I didn’t remove them.
3. The word clouds are case-insensitive, but the tree maps are not. That’s because case really is sometimes important, as when you refer to some specific proper-noun Parliament versus a generic “we need a new parliament”.
4. To my best knowledge, none of the parliamentarians calling for deparliamentarization have been asked to spell ‘deparliamentarization’.
5. The treemaps are generated using https://developers.google.com/chart/interactive/docs/gallery/treemap, which was a joy to use.
6. I sized the visualization for each by the log of the counts in the corpus, to smooth out the more disparate counts.
7. I choose the languages above as a diverse subset of those available from the proceedings of the European Parliament … but partially because I already had these ones handy for a quick analysis.
8. [Edit - added thanks to Jean, in comments] Some of variation comes from compounds, not affixes. For languages like German, compounding is very common, which was why I omitted German here. Compounding can also make keyword-based language technologies less than ideal.

Read more