“Superb!” — often, a one word review like this one encapsulates the singularly superlative experience that one of our diners had at an OpenTable restaurant.

While we do see a few of these extremely terse declarations of satisfaction pop up here and there, the typical OpenTable review is more verbose. Our reviewing customers often hit the 2000-character upper limit to pour out their hearts. They are passionate about every aspect of fine dining, and leave detailed, nuanced and constructive reviews of their journey through their dining experience.

The level of care we see in these reviews make them unquestionably one of the most important sources of insights into the ecosystem of restaurants and diners. Potential diners will often go through hundreds of reviews to help them decide where to dine next. A restaurateur, on the other hand, will always keep a sharp eye out for reviews to gauge how their business is doing, and what, if anything, needs more work.

Reviews can also be a great way to familiarize oneself with the dining scene in a new neighborhood, city, or country. Reviews can inform the diner about aspects of a restaurant that are not obvious from its description — is it a local gem or a tourist trap? Does this restaurant have a view? If so, a view of what, and is the view particularly stunning during sunset? Is the service friendly?

## Mining Reviews

Mining reviews for insights is an obvious thing to do, but it is not necessarily easy. As is usually the case with unstructured data, there is a lot of information buried in a lot of noise.

Once you have read through a few reviews you start getting the sense that there is only a handful of broad categories that people are writing about. These may range from food and drink related comments, to sentences devoted to ambiance, service, value for money, special occasions, and so on. Within a category like, say, ambiance, you would come across distinct themes such as live music, decor, and views. One  review may contain only one or two of these themes (say seafood and views) while another may contain several.

It is therefore only natural that one of the first things that one would like to extract from a corpus of reviews are all the themes that occur across reviews. In more technical parlance, what we have been calling themes are known as topics, and the technique of learning these topics from a corpus of documents is called Topic Modeling.

Suppose all of our reviews are generated from a fixed vocabulary of, say, 100,000 words, and  we learn 200 topics from this corpus.  Each topic is a distribution over this vocabulary of words. The way one topic is different from another is in the weights with which each word occurs in them.

In the image of above, we display a sample of six topics learned from our review corpus. We show the top 25 words from each topic, with  the size of the each word scaled proportionally to the importance of that word in that topic. It does not take any effort to see how tightly knitted topics are, and what they are about.  Just by looking at them, we know that the first one is about  steakhouse food,  the second one is about wine, the third about live music, and the next three about desserts, bar scene, and views.

When modeling is performed topics basically just fall out. That this can be achieved is an amazing fact, given that we did not have to label or annotate the reviews beforehand, or tell the algorithm that we are working in the space of restaurant reviews. We basically throw in all the reviews in the mix, and out comes these topics.

A byproduct of topic modeling are the weights with which each each topic is associated with each review. For example, consider the following three reviews:

1. “They had an extensive wine list to chose from, and we each ordered a glass of the 1989 Opus One to pair with our NY strip steaks. We sat near the live jazz band.”
2. “The view of the sunset over the ocean was spectacular, while we sat there savoring the dark chocolate pudding meticulously paired with the wine by our very knowledgeable sommelier.
3. “The restaurant was crowded so we sat at the bar. The bartender whipped up some amazing cocktails for us. There was blues playing in the background.”

It is easy to see that Review 1 mainly draws from the wine, steakhouse and live music topics, while the other topics like desserts or view have zero weight in this review. Review 2, on the other hand, is about the view topic, a bit about the desserts topic, and again the wine topic. Review 3 draws  mostly from the bar scene and the live music topics.

The intuition here is that documents, in our case reviews, are composed of multiple topics.  The share of topics  in each review is different. Each word in each review comes from a topic, where that topic is one of the topics in the per-review topic distribution.

Next, we discuss how in practice we learn topics from a review corpus.

## From Reviews to Topics using Matrix Factorization

A popular approach for topic modeling is what is known as the Latent Dirichlet Allocation (LDA). A very approachable and comprehensive review of LDA is found in this article by David Blei. Here, I am going to use an alternative method to model topics, based on Non-negative Matrix Factorization (NMF).

### Bag-of-words

To see how reviews can be put in a matrix form, consider again the three reviews above. A usual first step is to remove stop words — terms that are too common, such as  “a”, “and”, “of”, “to”, “that”, “was” etc. Now consider all the tokens left in these three reviews — we have 39 of them. So we can express the reviews as a 3 by 39 matrix, where the entries are the counts or term-frequencies (tf) for a certain token in a review.  The matrix looks like the following:

### TF-IDF

Note that while a word like bartender  is unique to only one review,  the word sat is all three reviews, and should have less weight in the matrix as it is less distinctive.   To achieve this, one usually multiplies these term frequencies with an inverse-document-frequency (idf), which is defined as $\log\left(\frac{n}{1+m(t)}\right)$ where $n$ is the number of documents in the corpus and $m(t)$ is the number of documents in which the token $t$ occurs.  If a token occurs in all documents, the ratio within the brackets is almost equal to unity which makes  its logarithm almost equal to zero.

Here is what the matrix looks like after tf-idf. Note that the word “sat” now has much lower importance relative to other words.

### NMF

In practice,  the document-term matrix $\bf{D}$ can be quite big,  $n$ documents tall and $v$ tokens wide, where $n$  can be a  several millions, and  $v$ several hundreds of thousands.   One more step that is usually performed to precondition the matrix is to normalize each row, such that the squares of the elements add up to unity (other normalizations are also possible).

Matrix Factorization (MF)  takes  such a matrix  $\bf{D}$ of dimension $[n \times v]$, and approximates it as a product of two low rank matrices: a$[n\times k]$ matrix  $\bf{W}$, and  a $[k\times v]$ matrix $\bf T$, where$k$ is a small number, typically in few tens to a few hundreds. This is shown schematically below:

NMF is a variant of MF where we start with a matrix $\bf D$ with non-negative entires like our document-term matrix, and also constrain the elements of $\bf W$ and $\bf T$ to be  non-negative.

Everything being non-negative lets us interpret the factorization in an additive sense, and interpret each row of the $\bf T$ matrix as a  topic. This is how it works:

Let’s take the first row of $\bf D$. That is essentially our first review, expressed as a vector of length v. Remembering  how matrix multiplication works, what the above relation tells us is that we can reconstruct this review approximately by linearly combining the $k$ rows of the matrix $\bf T$ with weights taken from the first row of $\bf W$ – the first element of the first row of $\bf W$ multiplying the first row of $\bf T$, the second element of the first row of $\bf W$ multiplying the second row of $\bf T$, and so on.

Each row of $\bf T$ is a distribution over the $v$ terms in a vocabulary, and easily interpreted as the topics described in the earlier section. What this factorization says is that each of the $n$ reviews (rows in $\bf D$) can be built up by a different linear combination of the $k$ topics (rows in $\bf T$).

So there we have it, $\bf W$ expresses  the share of topics in each review, while each row of $\bf T$ represents a topic.

### Code

Here are some Python code to perform these steps:

import sklearn.feature_extraction.text as text
import numpy as np

# This step performs the vectorization,
# tf-idf, stop word extraction, and normalization.
# It assumes docs is a Python list,
#with reviews as its elements.
cv = text.TfidfVectorizer(docs, stop_words='english')
doc_term_matrix = cv.fit_transform(docs)

# The tokens can be extracted as:
vocab = cv.get_feature_names()

# Next we perform the NMF with 20 topics
from sklearn import decomposition
num_topics = 20

#doctopic is the W matrix
decomp = decomposition.NMF(n_components = num_topics,
init = 'nndsvd')
doctopic = decomp.fit_transform(doc_term_matrix)

# Now, we loop through each row of the T matrix
# i.e. each topic,
# and collect the top 25 words from each topic.
n_top_words = 25
topic_words = []
for topic in decomp.components_:
idx = np.argsort(topic)[::-1][0:n_top_words]
topic_words.append([vocab[i] for i in idx])