The language of food

Posted 09.16.2014

Food expressions are one of most global forms of language. We call the French food croissant a “croissant” in English as well as French, and don’t translate it to ‘crescent’, as we would for any other use of the word crescent in French.

Sometimes the original meanings of foods are rarely known. For example, barbeque (my favorite food) comes from the Taíno language of Hispaniola, the island containing the Dominican Republic and Haiti. The original word barabicu roughly translates as to ‘sacred fire pit’.

Often the food itself changes, too, even when the expressions do not. For example, ketchup was once a fish-based Fujianese sauce that was made thousands of miles from the nearest tomato.

We are happy to report that the complexities and etymologies of food expressions can be explored in length in a new book by Idibon’s advisor, Dan Jurafsky:

Dan will tell you exactly how ketchup went from fish to tomatoes, and many other interesting observations about food expressions, such as why you worry when a restaurant menu insists that its dishes are ‘fresh food’.

See for more about the language of food, for dates/locations of where Dan will talk about it.

You can also read more about the book in the New York Times:

Or hear about it on NPR:

You should also listen to this earlier NPR report on Dan’s work with colleagues on the language of Yelp reviews. You get to hear Audie Cornish say, “Shame on you, cupcake”.

We wish Dan the best of success! We’ll be reading our copy at Idibon over lunch.

Rob Munro

Read more

The grammar of emoji

Posted 07.29.2014

    Until recently, you probably only had about 722 emoji to enliven your text messages. You no doubt felt like that wasn’t adequate to express all complex ideas and emotional needs (in graphical icon form). Fear not, because 250 more emoji have been added to “Unicode 7.0″, which means your smart phone should have these vital images (like Man-in-business-suit-levitating) ready for you to use very soon.

    This update to has led to a flurry of curiosity about emoji. Idibon was quoted in Katy Steinmetz’s article on the emoji boom in this month’s Time magazine:

    You can also check out this work I did as part of a Time exclusive on “The Grammar of Emoji” (full access to this one is free):


    Unicode is the standard for written language that makes it possible for all our phones and computers to interpret characters in the same way. It applies to emoji, and also the Latin, Korean, and Tibetan scripts. Unicode covers punctuation and space characters, too – find some examples in my post on punctuation marks from other languages that are well worth adopting: 7 Awesome Non-English Punctuation Marks You Need to Use.

    To get a look in the fight for emoji and its place in how we visually represent language, see Rob’s recent post with a view from the inside of Unicode’s construction: Isaac Newton vs. Millions of Japanese Teens.

    And if you’re in a languaging mood, definitely go read @Anagramatron. Colin Rothfels, who is behind that, had been collecting millions of English tweets and provided the data and a lot of the inspiration for this emoji work. Thanks, Colin! And thanks, Ben Zimmer for connecting us!

    We also thank Time for their coverage of Idibon’s work; Unicode for their continuing efforts to allow people to connect; and the world’s texters for continually innovating new grammars from every piece of language available!

    - Tyler Schnoebelen (@TSchnoebelen)

      Read more

      Isaac Newton vs. millions of Japanese teens

      Posted 07.02.2014

      Less than 1% of languages have spontaneously developed a visual form. Most languages have never been written, and even English became a written language only relatively recently. Far fewer languages have developed their own writing system–almost all of them have adapted and borrowed their writing system from other languages.

      There is clearly something inherently verbal about language that has made visual representations so rare. Despite the absence of language visualizations historically, it’s a hot topic now. That’s because people need to find ways to represent human communication at never-before-seen scales. The cutting edge was on display at the Workshop on Interactive Language Learning, Visualization, and Interfaces at the Annual Meeting of the Association for Computational Linguistics in Baltimore, last week.

      We are proud to have sponsored the workshop by awarding travel grants (see announcement for more details).

      ACL 2014. Language visualization in Baltimore.

      ACL 2014. Language visualization in Baltimore has not been this popular since The Wire

      Isaac Newton vs. millions of Japanese teens

      Visualizing language is complicated. See our post on Petulant Punctuation where even the humble period “.” at the end of a communication can drastically change its tone. It is also very beautiful and diverse. See our post in 7 Awesome Non-English Punctuation Symbols You Need. And punctuation is just the smallest step beyond visual representation of verbal speech.

      When we look at other symbols in language, it gets messy very quickly. As I have said before, I believe that Unicode is the most sophisticated visualization framework, ever. The characters render across phones, tablets and computers, globally. The characters are as intelligible on a tiny phone display as they are on the giant screens in stadiums.

      It is easy to forget that the Unicode standard did just not appear on our devices by magic. It is the result of a small number of very smart people. If you look at, every ordering of charters and mapping to binary forms was a conscious decision. For a short time, I was a liaison to the Unicode Consortium, which is the organization that decides on the standard for gets to officially become a character in digital communications. My influence on the current standard is near zero, and I was mainly a passive observer on email chains where many decisions were made. However, I was witness to when Japanese ‘emoji’ were being added to the standard. This taught me how complicated and tense the process could be.

      Emoji are picture characters that are particularly popular in Japan, especially among teenagers using phones. In the last few years they are becoming more widespread. Some examples that should display in your browser are: ⚔⚘☃☂.

      As Wikipedia puts it, “[the addition of Emoji] went through a long series of commenting by members of the Unicode Consortium and national standardization bodies of various countries”.

      This is true, but “a long series of commenting” fails to capture the intensity of the actual debate. I took a few hundreds emails sent on Unicode mailing list about emoji, and compared them to emails on the list from the same period that were not about emoji. So, to get very meta, I created a language visualization of the discussions around emoji’s inclusion in the Unicode standard. Here are the words that were most indicative of the conversations about emoji vs. other topics (using point-wise mutual information):

      The keywords used by the Unicode Consortium when debating the inclusion Emoji into the Unicode standard.

      The keywords used by the Unicode Consortium when debating the inclusion Emoji into the Unicode standard.

      “Specious”, “stupid”, “abuse”. It’s not surprisingly that the word ‘emoji’ was popular in the discussions, but clearly there was blood on the walls in the debate about whether cute icons should be considered language.

      Isaac Newton’s notebooks were being encoded at the same time. I attend the meeting where the Unicode standards for Newton’s symbols were debated, fittingly under the biggest apple possible: Apple’s headquarters. There was one particular symbol that occurred only once: a handwritten scrawl that probably related to an element and/or valence, but the meaning was lost in time. The final standard and symbols can be seen at the The Chymistry of Isaac Newton.

      Compare the debate around emoji, used by millions of people daily, to the debate around a symbol of unknown meaning that occurred exactly once in human history. And that debate was…

      Sir Isaac Newton

      Sir Isaac Newton

      … crickets. There was no argument at all. The only debate was whether this one mystery symbol was truly unique or a variant representation of another symbol. For some reason, it held higher status than millions of messages sent each day by teenagers, at least in terms of a standard, but it was difficult to pin down exactly why people have such strong, and conflicting, views about how to best represent language.

      Because it’s hard

      In Marti Hearst’s presentation at the workshop, she offered this simple explanation for why visualizations for language have not been standardized and why interfaces for search engines have barely changed in the last decade:

      “Because it is hard”

      The debate around emoji, and lack thereof around Newton’s scribbling, from the leading experts show how this is true.

      Workshopping what language looks like

      The Workshop on Interactive Language Learning, Visualization, and Interfaces tackled this complicated topic from a number of directions.

      It was interesting to see both incredible variety and imagination alongside a difficulty in rising above the simple ‘word cloud’ for topics. This was one of the fieriest debates that was repeated through the day: broadly, people offering variations on word cloud technologies, and people offering arguments about why word clouds are not an effective tool for conveying information–the ’3D pie chart’ of text visualization, as some would claim.

      The graphics below taken from the papers show there is much more to this field:

      Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation
      Angel Chang, Manolis Savva, Chris Manning

      Angel Chang, Manolis Savva, Chris Manning

      Angel Chang, Manolis Savva, Chris Manning

      GLANCE Visualizes Lexical Phenomena for Language Learning
      MeiHua Chen, Shih-Ting Huang, Ting-Hui Kao, Hsun-wen Chiu, Tzu-Hsi Yen

      Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis
      Glen Coppersmith, Erin Kelly

      SPIED: Pattern Based Information Extraction and Diagnostics
      Sonal Gupta, Chris Manning

      Interactive Exploration of Asynchronous Conversations: Applying a User-Centered Approach to Design a Visual Text Analytic System
      Enamul Hoque, Giuseppe Carenini, Shafiq Joty

      Enamul Hoque, Giuseppe Carenini, Shafiq Joty

      Enamul Hoque, Giuseppe Carenini, Shafiq Joty

      Design of an Active Learning System With Human Correction for Content Analysis
      Nancy McCracken, Jasy Suet Yan Liew, Kevin Crowston

      MiTextExplorer: Linked brushing and mutual information for exploratory text data analysis
      Brendan O’Connor

      Brendan O’Connor

      Brendan O’Connor

      LDAVis: A Method for Visualizing and Interpreting Topics
      Carson Sievert, Kenneth Shirley

      Concurrent Visualization of Relationships Between Words and Topics in Topic Models
      Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

      Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

      Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber

      Hiérarchie: Visualization for Hierarchical Topic Models
      Alison Smith, Timothy Hawes, Meredith Myers

      Alison Smith, Timothy Hawes, Meredith Myers

      Alison Smith, Timothy Hawes, Meredith Myers

      MUCK: A Toolkit for Extracting and Visualizing Semantic Dimensions of Large Text Collections
      Rebecca Weiss

      Active Learning With Constrained Topic Model
      Yi Yang, Shimei Pan, Doug Downey

      The invited speakers were:
      Chris Culy (Universität Tübingen)
      Marti Hearst (University of California, Berkeley)
      Jimmy Lin (University of Maryland, College Park)
      Noah Smith (Carnegie Mellon University)
      Krist Wongsuphasawat (Twitter)

      And the conference was organized by:
      Jason Chuang (UW)
      Spence Green (Stanford)
      Marti Hearst (UC Berkeley)
      Jeffrey Heer (UW)
      Philipp Koehn (Edinburgh)

      Whether it’s a quick graph for understanding linguistic trends, or symbolic representations of language for the ages, we look forward to more from these researchers and similar workshops in the future!

      Rob Munro

      Read more