For language processing tasks, it’s easy to imagine why you would want to conflate synonyms. For example, if you wanted to classify some news article as being ‘sports related’, you would likely be more accurate if you know that shooting hoops and playing bball mean more or less the same thing.
However, there are few pure synonyms in a language and the differences can matter. Whether or not you care about the differences depends greatly on the context. How do you invite your friends to play vs. watch a game? Does the area you live in have a particular preference? This is a post about the information to be gained by looking at variations that, at first glance, seem to mean the same thing. What would we lose by lumping them all together?
As part of text analysis of social media streams, we were parsing sports terms from about 10 Million tweets. In the stream, there were 3,740 individuals that used at least one of the terms basketball, bball, hoops. In terms of mentions, there were 10,944 occurrences of one of the three terms—but basketball was by far the most popular (78.6% compared to 12.1% for bball and 9.35% for hoops).
As a side note, different senses of words complicate the matter. For example, we filtered around 250 non-basketball-related hoops. This was a big chunk—about one-fifth of all mentions. As you might guess, most of the non-basketball occurrences were about ‘jumping thru’ and ‘hula’ hoops. This is important for synonym conflation—you can imagine errors unnecessarily propagating if the unambiguous ‘jumping through hoops’ was conflated with basketball terms. You need NLP systems that are sensitive to these differences, or better yet, can discover them automatically. Sad news, btw, no one reported any bureaucracies making them jump through hula hoops.
Figure 1: Different co-occurrence patterns (except for “watch*”)
In Figure 1, you can see the ratio of observed occurrences to what would be expected if things were at chance. There are 686 tweets with at least one basketball term and lol. Since 12.1% of all the basketball tweets have bball in them, if everywhere were random, we’d expect 686*12.1%=83 tweets with both lol and bball. Instead we observe 135—that’s 168% more than we’d expect by chance.
The further away from 1.0 that an observed/expected value is, the more something is going on. For example, people add links to tweets when they are talking about hoops a lot more than when they are talking about bball. And as a friend of mine guessed, hoops is most commonly used about college hoops. There’s only half of the occurrences we’d expect for nba+bball. Meanwhile, watch/watches/watching/watched show an example of something that people are equally likely to use with all three terms.
Some other findings:
- People seem to talk more about what they are doing/feeling with basketball and bball than with hoops.
- Bball goes especially with getting/going/coming “out” physically or checking something “out” virtually. (At the time of the dataset, this was just a motion-type of coming out, not a metaphorical closet coming out. Your turn: which basketball terms are used most with the Jason Collins story?)
- If we pull the target words out of tweets and calculate their lengths we see that basketball tweets are shorter (84.9 characters on average compared to 96.9 for hoops and 96.2 for bball). Note that people who talk about the #knicks only had 72.8 characters on average—you can imagine that this has to do with fast typing thumbs speeding along during a game and not being all that wordy. (There are other stories available if you dislike the #knicks, but I’m not telling them today.)
In addition to lumping words, we might wonder about lumping people. Rather than tell you how, say, men and women are using these terms, let’s instead cluster people based on words they use in common (for more about gender, social theory and computational methods, see this paper).
People who use a lot of African American Vernacular English terms across all of their tweets (e.g., finna) tend to like bball and dislike hoops. But whoa-there-on-monolithic-statements, there’s more than one AAVE-heavy cluster and the different clusters use the terms at different rates. What’s more, a cluster of tech people who especially talk about api‘s, ios, ui‘s, portal‘s, plugin‘s, and developers also like bball and dislike hoops. And people who talk about startups and brewing like bball (but lack the antipathy for hoops—they use it at chance).
People who talk about #socialmedia, linkedin, #photo, seo, webinar‘s, infographic‘s, and klout really like to use hoops and really avoid bball. The same pattern holds for a separate cluster of users who talk about #americanidol, hipsters, #oscars, and #goldenglobes. If you were talking to either of these groups, go for hoops.
To get a sense of how much variation there is in our three terms, let’s see what happens when we limit ourselves to people who use at least one of the terms in five or more tweets. There are 586 such people.
- 193 use only basketball
- 4 use only hoops
- 2 use only bball
- 167 use basketball and bball
- 135 use basketball and hoops
- 85 people use all three
In other words, these terms are in some sort of variation for about 66% of people.
Recognizing that different people talk differently—and that the same people talk differently at differently at different times—one of the key tasks for natural language understanding and text analysis is to find insights in the differences.
– Tyler Schnoebelen (@TSchnoebelen)