Less than 1% of languages have spontaneously developed a visual form. Most languages have never been written, and even English became a written language only relatively recently. Far fewer languages have developed their own writing system–almost all of them have adapted and borrowed their writing system from other languages.
There is clearly something inherently verbal about language that has made visual representations so rare. Despite the absence of language visualizations historically, it’s a hot topic now. That’s because people need to find ways to represent human communication at never-before-seen scales. The cutting edge was on display at the Workshop on Interactive Language Learning, Visualization, and Interfaces at the Annual Meeting of the Association for Computational Linguistics in Baltimore, last week.
We are proud to have sponsored the workshop by awarding travel grants (see announcement for more details).
Isaac Newton vs. millions of Japanese teens
Visualizing language is complicated. See our post on Petulant Punctuation where even the humble period “.” at the end of a communication can drastically change its tone. It is also very beautiful and diverse. See our post in 7 Awesome Non-English Punctuation Symbols You Need. And punctuation is just the smallest step beyond visual representation of verbal speech.
When we look at other symbols in language, it gets messy very quickly. As I have said before, I believe that Unicode is the most sophisticated visualization framework, ever. The characters render across phones, tablets and computers, globally. The characters are as intelligible on a tiny phone display as they are on the giant screens in stadiums.
It is easy to forget that the Unicode standard did just not appear on our devices by magic. It is the result of a small number of very smart people. If you look at http://unicode-table.com/, every ordering of charters and mapping to binary forms was a conscious decision. For a short time, I was a liaison to the Unicode Consortium, which is the organization that decides on the standard for gets to officially become a character in digital communications. My influence on the current standard is near zero, and I was mainly a passive observer on email chains where many decisions were made. However, I was witness to when Japanese ‘emoji’ were being added to the standard. This taught me how complicated and tense the process could be.
Emoji are picture characters that are particularly popular in Japan, especially among teenagers using phones. In the last few years they are becoming more widespread. Some examples that should display in your browser are: ⚔⚘☃☂.
As Wikipedia puts it, “[the addition of Emoji] went through a long series of commenting by members of the Unicode Consortium and national standardization bodies of various countries”.
This is true, but “a long series of commenting” fails to capture the intensity of the actual debate. I took a few hundreds emails sent on Unicode mailing list about emoji, and compared them to emails on the list from the same period that were not about emoji. So, to get very meta, I created a language visualization of the discussions around emoji’s inclusion in the Unicode standard. Here are the words that were most indicative of the conversations about emoji vs. other topics (using point-wise mutual information):
“Specious”, “stupid”, “abuse”. It’s not surprisingly that the word ‘emoji’ was popular in the discussions, but clearly there was blood on the walls in the debate about whether cute icons should be considered language.
Isaac Newton’s notebooks were being encoded at the same time. I attend the meeting where the Unicode standards for Newton’s symbols were debated, fittingly under the biggest apple possible: Apple’s headquarters. There was one particular symbol that occurred only once: a handwritten scrawl that probably related to an element and/or valence, but the meaning was lost in time. The final standard and symbols can be seen at the The Chymistry of Isaac Newton.
Compare the debate around emoji, used by millions of people daily, to the debate around a symbol of unknown meaning that occurred exactly once in human history. And that debate was…
… crickets. There was no argument at all. The only debate was whether this one mystery symbol was truly unique or a variant representation of another symbol. For some reason, it held higher status than millions of messages sent each day by teenagers, at least in terms of a standard, but it was difficult to pin down exactly why people have such strong, and conflicting, views about how to best represent language.
Because it’s hard
In Marti Hearst’s presentation at the workshop, she offered this simple explanation for why visualizations for language have not been standardized and why interfaces for search engines have barely changed in the last decade:
- “Because it is hard”
The debate around emoji, and lack thereof around Newton’s scribbling, from the leading experts show how this is true.
Workshopping what language looks like
The Workshop on Interactive Language Learning, Visualization, and Interfaces tackled this complicated topic from a number of directions.
It was interesting to see both incredible variety and imagination alongside a difficulty in rising above the simple ‘word cloud’ for topics. This was one of the fieriest debates that was repeated through the day: broadly, people offering variations on word cloud technologies, and people offering arguments about why word clouds are not an effective tool for conveying information–the ’3D pie chart’ of text visualization, as some would claim.
The graphics below taken from the papers show there is much more to this field:
Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation
Angel Chang, Manolis Savva, Chris Manning
GLANCE Visualizes Lexical Phenomena for Language Learning
MeiHua Chen, Shih-Ting Huang, Ting-Hui Kao, Hsun-wen Chiu, Tzu-Hsi Yen
Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis
Glen Coppersmith, Erin Kelly
SPIED: Pattern Based Information Extraction and Diagnostics
Sonal Gupta, Chris Manning
Interactive Exploration of Asynchronous Conversations: Applying a User-Centered Approach to Design a Visual Text Analytic System
Enamul Hoque, Giuseppe Carenini, Shafiq Joty
Design of an Active Learning System With Human Correction for Content Analysis
Nancy McCracken, Jasy Suet Yan Liew, Kevin Crowston
LDAVis: A Method for Visualizing and Interpreting Topics
Carson Sievert, Kenneth Shirley
Concurrent Visualization of Relationships Between Words and Topics in Topic Models
Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber
Hiérarchie: Visualization for Hierarchical Topic Models
Alison Smith, Timothy Hawes, Meredith Myers
Active Learning With Constrained Topic Model
Yi Yang, Shimei Pan, Doug Downey
The invited speakers were:
Chris Culy (Universität Tübingen)
Marti Hearst (University of California, Berkeley)
Jimmy Lin (University of Maryland, College Park)
Noah Smith (Carnegie Mellon University)
Krist Wongsuphasawat (Twitter)
And the conference was organized by:
Jason Chuang (UW)
Spence Green (Stanford)
Marti Hearst (UC Berkeley)
Jeffrey Heer (UW)
Philipp Koehn (Edinburgh)
Whether it’s a quick graph for understanding linguistic trends, or symbolic representations of language for the ages, we look forward to more from these researchers and similar workshops in the future!Read more