What does the sentence “The Sicilian gelato was extremely rich” have in common with “The Italian ice-cream was very velvety”?
To a human, the two sentences (incidentally taken from two different reviews of the same restaurant) have a similar key theme: This restaurant has a gelato dish diners are raving about. But consider this from the perspective of a machine: Apart from the words “the” and “was” (which are ubiquitous across reviews and considered stop-words), there are no words in common. How can we teach a machine how to learn that these two sentences have similar themes?
Enter the “Word Mover’s Distance” [WMD]. WMD was introduced in a recent paper by Kusner et al. (2015). This paper describes a new approach to clustering documents by defining the distance between them in terms of the vector embeddings of the words that make up the documents a la Word2Vec (Mikolov et al. 2013a,b).
The Word2Vec model generates word embeddings in a vector space that often preserves semantic relationships between words. For example, the vector vec(Japan) – vec(sushi) can be seen as representing country-to-iconic-food relationship, and vec(Japan) – vec(sushi) + vec(Germany) identifies that the iconic food of Germany is likely to be vec(bratwurst) (Mikholov et al. 2013a,b). For an insightful review of the Word2Vec method I refer you to the excellent blog post by our colleagues at StitchFix. The word embedding is learned in an entirely unsupervised manner, and can be precomputed to be used in the WMD step.
Word2Vec also brings word-vectors that are semantically synonymous close to each other in the embedding space. For instance, if we started with vec(Italian) and looked for the vectors nearest to it, we would find vec(Sicilian) close by. Similarly, the vectors for gelato and ice-cream are going to be close to each other. So going back to the two sentences above, using Word2Vec distances between words, we could teach a machine that these two sentences are semantically related. That is exactly what the WMD algorithm sets out to do. Schematically, this is how the machine learns how the two sentences about gelato are connected:
This figure is inspired by Fig. 1 of Kusner et al (2015). It shows schematically how each word in the first document lie with respect to the closest words in the other document within the word2vec embedding space. The distance between the two documents can be computed as the minimum cumulative distance that words from the first document need to travel to exactly match the second document.
Incidentally, this optimization problem maps on to a well studied problem in transportation theory whose solution is the so-called Earth Mover’s Distance. Kusner et al. (2015) essentially adapts the EMD algorithm to work with words in documents — hence the name Word Mover’s Distance. The paper shows that the WMD metric outperforms seven state-of-the-art baseline metrics in terms of k-nearest neighbor document classification errors.
Application to Restaurant Reviews
At OpenTable, the reviews left by diners are usually rich, and touch upon many aspects of the dining experience: Food, service, ambiance, and more. Sometimes, prospective diners would scan through the reviews to look for all mentions of one aspect. For example, if the restaurant has a slightly lower than average rating on service, diners would go through the reviews trying to find the common thread of why the service was rated low. If the restaurant is getting rave reviews for, say, cocktails, the potential diner would like to only scan for sentences that mention cocktails.
To explore to what extend WMD can be applied to such a use case, we ran Word2Vec on about 5 million of our reviews, and then picked one restaurant and generated the WMD distance matrix for all its reviews. Then we picked a sentence that expresses a certain concept and found all sentences closest to it. Below is an example of this approach at work, looking at reviews for a restaurant which might have noise issues.
The first sentence is the query sentence, while the others are the closest as per the WMD metric. Notice that although many sentences have the word “loud” explicitly many of them do not. For example, the closest sentence uses “noisy” and “conversation” which are semantically very close to “loud” and “speak” respectively. Further down, you see words like “communication” and “talking” “difficult” etc. that pull in these sentences via semantic similarity. Pretty remarkable!
Here is another example, where people are talking about the restaurant not living up to the hype:
It is interesting to see how by association of the other words with the concept of hype, it learns the various ways people are expressing the same opinion.
Again here are a bunch of reviews that navigate the concepts of drinks, from beers to wines to cocktails to even Pinot Noir!
We’re looking forward to putting the Word Mover’s Distance to work!