Generating fake Yelp reviews with natural language analysis and deep learning.
By: Asaph Kupferman and Jesse Xu
The past 10 years has seen a dramatic rise in the prevalence and influence that digital platforms have exerted on our lives. For the most part, this has been hugely beneficial; we can stay socially connected with our Middle School friends via Facebook, browse cat videos posted by someone in South Korea, or even look at thousands of people’s written impressions of a product before electing whether or not to purchase.
However, the nature of these platforms allow for extreme anonymity; someone can easily be creating a false digital footprint on any of these platforms. Profiles can be created with random pictures found on the internet, and reviews be completely fabricated. The inherent separation between content creator and consumer created by the internet allows for potentially nefarious actors to easily distort reality. It’s fairly well-documented that Instagram influencers (or even businesses trying to build a following on the platform) will “buy” followers for their page who do not actually exist outside of a “bot farm” somewhere across the globe. Similarly, it’s known that Amazon sellers will try to engage in similar practices to boost their product’s reviews/ratings. This is sometimes done by incentivizing customers (via direct cash payments) to leave glowering reviews, but positive reviews are also often generated by bots. During the 2016 United States Presidential Election, Russia famously ran a massive Social Media campaign through Facebook that reached nearly 126 million people. This continued throughout the 2020 United States election.
We were inspired after seeing the Yelp Dataset, a massive accumulation of millions of reviews and businesses. In the current age of social media “fakeness,” we wondered: could we also reliably produce “fake” restaurant reviews that might sway a potential customer?
Even this simple search of “Best Bunch in Philadelphia” inundates me with thousands of convincing, heartfelt reviews (both extremely positive/negative) motivated by experiences individuals have had at some of these Philadelphia Brunch establishments.
Given the impressive advances in Natural Language Processing in recent years (see: GPT-3), and the overwhelming amount of existing resources invested in manufactured authenticity on the internet, we thought it’d be interesting to explore “faking” Yelp reviews.
Luckily, Yelp did a good job organizing their dataset and provided ample documentation for understanding its structure. The dataset is comprised of 6 JSON files, corresponding to different pieces of information. For our analysis, we were only concerned with “business.json” and “review.json.” These files contained information on the various businesses included in the dataset (such as name, address, business category, various attributes, etc.) and a list of the reviews stored on Yelp with their relevant attributes (user, business, etc.), respectively. Due to the sheer size of the datasets, we uploaded the files to an Amazon S3 Folder (Amazon Web Services Simple Storage) and then used Apache Spark on an Amazon EMR cluster (Elastic MapReduce; a framework for distributing large computations across multiple Amazon Elastic ComputeCloud instances for faster return times, essentially virtual machines) to extract the .json files to Spark DataFrames. This was necessary due to the sheer sizes of these files, ranging in the tens of millions of records (with the column for the review records including full paragraphs!); attempting any operations using Google Colab resulted in quickly running out of RAM and a disconnected runtime. Thus, we had to run everything in Spark.
The dataset consisted of 209,393 businesses listed in business.json. However, we needed to filter for only restaurants, which we did by only keeping businesses that included one of a master list of tags we compiled. After filtering, we were left with 26,789 restaurants, and the dataset looked like this:
Moving on to reviews.json, importing the dataset to spark using a pre-defined schema yielded the following table:
This filtered dataset is massive, containing 8,021,122 reviews of Yelp businesses (each one being multiple sentences, so at least 25 million sentences and 200+ million words!). One nice thing about the dataset is that each review has a tag corresponding to the relevant businesses. Therefore, we can use a simple Inner Join to combine the two datasets and drop any reviews not related to a business we’re interested in. Now we can begin our Exploratory Data Analysis (EDA).
Exploring The Data
The Businesses Dataset
First, let’s explore the business category. Where are these businesses located? The Yelp Dataset mentions that it contains information for businesses in “10 Metropolitan Areas,” but doesn’t actually list what these metropolitan areas are! Briefly glancing at the image of the businesses dataset above shows that each record contains the “city” and “state” associated with each business. Right off the bat, Montreal and Las Vegas seem to align with expectations. However, I’ve never heard of the “Scottsdale, AZ” metropolitan area. A simple Google Maps search shows Scottsdale:
So Scottsdale, AZ is actually a city contained within the Phoenix, AZ metropolitan area! This makes a lot more sense. However, how do we systematically find the 10 prevalent groups that each business belongs to?
In this case, there are many ways to skin our cat. The most straight-forward machine-learning approach would be to apply a clustering algorithm to the dataset (based off the latitude/longitude coordinates, or the Zip Codes) and see what comes up. However, before doing that, I thought it’d be worth investigating these restaurants by geographic location in a bit more detail:
Interesting, let’s see what the breakdown of restaurants is by state:
It looks like the restaurants in the dataset are primarily situated in Ontario, Arizona, Nevada, Quebec, Ohio, North Carolina, Pennsylvania, Alberta, Wisconsin, Illinois, South Carolina, with the other records appearing to be incorrectly listed. However, why do we have 11 states with a substantial number of listed businesses, if we only have 10 metropolitan areas represented in the dataset?
In order to tease out the large metropolitan areas, I first looked up the biggest cities in each of these states. For example, Arizona: Phoenix, Ontario: Toronto, Ottawa, Illinois: Chicago, Aurora, Urbana Champaign, Pennsylvania: Philadelphia, Pittsburgh, etc. Then, for each state, I found all of the unique cities listed, and sorted them by number of reviews:
I repeated this process systematically for each state represented, usually yielding one likely candidate for each state. However, something interesting happened when investigating South Carolina:
There are only 9 cities listed in South Carolina! However, looking into these cities yielded an interesting result:
These cities (which is really a misnomer; many are simply “towns”) are considered part of the Charlotte Metropolitan area! Using this approach, we have found our 10 metropolitan areas: Toronto, Phoenix, Las Vegas, Cleveland, Charlotte, Pittsburgh, Calgary, Madison, Quebec, Urbana-Champaign. We can simply sort the restaurants by state (including SC businesses as in NC) and continue:
What are the most prevalent types of restaurants in each city? Here are the top 5 for each, listed:
Does it surprise anyone that 5/7 American cities have “Pizza” or “Fast Food” as the most common type of restaurant? Here are those same results, visualized, but for the top 15 types:
What is the average rating and standard deviation by city?
Looks like Montreal has by far, the best food and the least polarizing set of reviewed restaurants.
The Reviews Dataset
Now, let’s bring our reviews into the fray. We’re going to use a package called textblob in order to analyze the sentiment associated with each of the reviews. This library will assign each review a sentiment score, ranging from -1 to 1, where a -1 is extremely negative and a +1 extremely positive. For each review, we’d assume that the sentiment directly correlates with the number of stars given by the reviewer.
We can see here that our assumption is correct. Furthermore, we’re curious what the average sentiment for each city’s reviews is, and how that relates to the average rating of the city’s restaurants.
Not much variability, aside from Toronto being the clear outlier (so I guess Canadians are not as polite as we thought!). Looking deeper, which cuisines tend to have the “nicest reviews”?
There is a clear correlation between sentiment and restaurant rating. It also seems that restaurants with a more niche market tend to average higher in user reviews.
It’d also be worth investigating the time trends for both rating and number of reviews. These figures would indicate either a rise/decline in popularity over time. Since the dataset is so massive (even these simple plots took hours of pure runtime to produce), we settled on a few top categories to illustrate the trend:
Interestingly, the number of reviews remains consistent across all food types, but the average rating fluctuates. The “total reviews” trend is likely more indicative of Yelp’s popularity as a platform, more than the popularity of the respective food type.
Now, let’s investigate the most frequent words used in good/bad reviews. To do this, we’ll first split the dataset and isolate for good (4 or 5 star reviews) and bad (1 or 2 star reviews). Due to the sheer size of the dataset (millions of paragraph-long reviews), we’ll have to split up a smaller sample size and extrapolate. In our case, for consistency, we’ve selected two different cuisines (New American and Mexican), and completed the good/bad split described above.
However, before doing so, we need to normalize and clean our review text! In NLP (Natural Language Processing), we will tokenize the text (splitting it up by word), make everything lowercase, remove non-alphanumeric characters, and remove punctuation, and “stopwords” (the most common words in the English language, such as “the”, “is”, etc). These are taken from nltk’s “stopwords” library. Finally, we lemmatize each word, which essentially means find the root of the word. This stops us from treating words like “walk” and “walking” as occurring separately—for the sake of our analysis, we want them to be treated as equivalent.
Interestingly, here are the results:
We do the same thing for Mexican restaurants:
A lot of similarities, especially for the bad reviews. It seems that reviewers tend to use more unique and distinct vocabularies to describe positive experiences, but share a common pool of “negative” language for poor ratings (this phenomenon will resurface later, when we try to create a model to generate reviews).
Let’s get a better picture of the frequency of these words using WordClouds.
There are some interesting trends here; it looks like “asked” and “minute” tend to be consistently used across both cuisines (in addition to words that we’d expect, such as “horrible”). I can imagine that “minute” comes from complaining about time, and “asked” pertaining to speaking with the manager.
Now that we have a better understanding of the data that we’re working with, let’s see if we can replicate the patterns we’ve seen by generating some of our own Yelp reviews. We will approach this in two different ways: first, we will attempt a more brute-force approach using a concept called a Markov Chain. Then, we will try to improve on our review generation by leveraging the text-processing capabilities of a machine-learning model.
As a baseline, let’s first generate a review through a non-ML method: a Markov Chain. A Markov Chain is a fairly common and relatively simple way to statistically model processes, from financial models to text generation (read more about the topic here). One fun example of a Markov Chain in action is r/SubredditSimulator, a section of the popular social media site Reddit dedicated to generating fake posts and comments entirely through Markov Chains. The model works in essence by predicting an event, or state, given the conditions of the previous state that the model was in.
How would this apply to text generation? Well, we can visualize a simple example of typing on your phone. As you type out a word, you notice that on top of your keyboard are suggestions for what word may come next. For example, typing “thank” brings up “you”, “God”, and “me” as suggestions. This is because given that first word, your next one following would most commonly be one of those three.
And that’s how we’ll make use of the Markov Chain! By reading in a large input of reviews, we can see which words are most likely to follow some other word and thus compile a dictionary of words to their heirs. Then, starting with a seed word, we can generate a word to follow, and another one after that, and so on, without needing to use any machine learning yet.
Filtering our reviews dataset for only those describing New American restaurants and that have a rating of 4 stars or higher, and generating a chain of 50 words, we get the following output:
How about one generated from only negative reviews?
Not bad! Would reviews generated from this method look different as the input words vary from restaurant category to category? We repeat our procedure with reviews matching restaurants in the Mexican category:
As you can see, the Markov Chain method is a simple but effective way of generating reviews that, at a quick glance, look halfway passable. However, let’s see if we can improve on the credibility of our fake reviews by bringing in some good old-fashioned machine-learning.
Model: Long Short Term Memory (LSTM) Neural Network with TensorFlow
For our deep-learning approach, we’ll be using TensorFlow, an open source ML platform developed by Google. We’ll be accessing its features through the Keras framework (read more here), a high-level API built on TensorFlow that’s become very popular in the machine-learning community for its ease of use (Keras was reportedly used at the Large Hadron Collider). The reason we use Tensorflow is because there’s well-written documentation of its libraries, as well as an abundance of resources for its application in natural language processing. Specifically, we will be taking advantage of the Long Short Term Memory (LSTM) network that ships with TensorFlow.
LSTM belongs to a class of artificial neural networks called Recurrent Neural Networks (RNN). These are a particular subset of all Neural Networks used in machine learning that, unlike a conventional NN, retains information from one step to the next, allowing it to learn from previous events. This is a requisite for text analysis since the structure of text plays a large part in its meaning. However, LSTMs take this retention a step further. While a RNN only retains information from the step immediately prior, a LSTM caches and stores information about all past events, allowing it to learn from long-term memory. This gives us a huge leg up in our language processing capabilities. Suppose we want to predict the word blanked out in the following text:
Ivan is a student at the University of Pennsylvania. He is walking through campus on his way to the soccer field. On the way down, he sees a good friend and asks him if he’d like to come along and play ______ with him.
Here, we want our model to learn from the prior instance of “soccer field” to then predict “soccer”. With a standard RNN, there is an insurmountable gap between the information we want to predict and what we need to predict it. However, with a LSTM we’re able to make use of this prior knowledge. For further reading, here’s a great medium article that goes deeper into the practical advantages of LSTMs in language analysis.
To prepare our Yelp reviews to be fed into our neural network, we cleaned and re-tokenized our review strings using the Keras tokenizer, which is necessary since neural networks cannot take in strings. To maintain a sane computation time, we filtered our dataset to only keep reviews of length 250 words or less (without this step, our tokenizer crashed our Colab virtual environment every time without fail). Since the reviews we will be generating will be relatively short, this works out for our current use case. However, if we were to generate longer reviews it would be advantageous to reevaluate our model setup. To keep the size of our inputs consistent, we then pad our strings to the length of the longest review.
Keras makes building a model simple by stringing together different neural network layers. We will be using two laters of LSTMs, as well as two Dense layers for further optimization. We will also implement a dropout constant in our model to prevent overfitting.
Here’s what our model setup looks like:
We will be using the Adam optimizer, along with cross entropy loss and a standard accuracy metric. With a sample size of 137,585 words, training our model over 150 epochs took approximately 21 minutes, with an average time of ~8s taken per epoch.
As shown above, the model is able to reach 78.12% training accuracy and a loss of 0.7543 within a relatively short amount of time. However, plotting the test accuracy and loss reveals some pretty lackluster and surprising results:
From the plots, it’s clear that accuracy on our test set begins to drop off at around 15 epochs or so, and never reaches a satisfactory level. Meanwhile, loss increases throughout. This is not the result we’re going for, and we may be tempted to limit the number of epochs to maximize accuracy. Doing so and rerunning our model seems to accomplish just that:
But, how do the reviews generated from each stack up?
To create a review string consisting of n words, we’ll first need to specify a seed string such as “This restaurant…”. We can then feed this text into our model, which will produce the most likely word to follow it based on the weights for each word created during training. We then append that to our seed text and feed it to the model once again. The difference between this approach and the method we used with our Markov Chain, however, is that unlike the latter when we generated only from the immediately prior word, our model takes into account the whole input string and produces an informed prediction. We repeat until a review of desired length has been generated.
Following this procedure to compare for both the 20 epoch model and the 150 epoch model, the results are pretty much a no-contest:
As we can see, the model trained for longer produces much more realistic reviews. The culprit for our low test accuracies seems to be an inherent difficulty in comparing snippets of text to each other. Although the model trained for only 20 epochs maximized accuracy, it did so by simply repeating the most common patterns and phrases seen in our reviews. Instead, our model trained for 150 epochs produced reviews that varied in content and structure. While these reviews do not validate well with new inputs, they nevertheless are much more realistic and thus more in tune with the goals of this project.
Let’s train another model, this time using bad reviews for New American restaurants as input. Here’s what we get:
What about for a different cuisine? Here are some reviews generated from using reviews for Mexican restaurants:
Overall, quite impressive! There’s a clear difference between reviews generated with a model fed positive reviews and one fed negative ones. Reviews also differ from category to category with the inclusion of specific menu items.
A curious observation is that the generated negative reviews appear more realistic than positive ones. One factor may be that there were less negative reviews than positive ones overall, which led to less data to compute over and a higher accuracy. Our hypothesis is that there’s also another factor at play that ties back to our analysis on reviews earlier: negative reviews seem to be more similar on a review-by-review basis in general. For example, “my food was cold” or “the service was slow” are common complaints that warrant a negative rating no matter the specifics of the restaurant, whereas positive reviews would tend to praise specifics of the experience more, such as naming specific menu items.
Let’s compare how these faked reviews compare to their non-ML counterparts from earlier using our Markov Chain generator with a seed text.
Turns out, there was a whole host of interesting insights that were waiting for us in the Yelp dataset. We saw, after doing a little data wrangling first, how both tastes and ratings varied from city to city (and that we need to plan a trip to Montreal soon). From our reviews dataset, we analyzed sentiments across categories, and highlighted the words most commonly used across good reviews and those used in bad reviews.
In our efforts to generate our own reviews, we saw that we could easily generate some basic yet interesting reviews using a Markov Chain. Leveraging the LSTM Neural Network, we were then able to create a machine-learning model that was capable of generating some impressively believable, yet fake, reviews. Switching between positive and negative reviews as inputs produced text that included many of the most commonly used words we isolated earlier, and reviews generated across categories varied with cuisine-specific menu items being explicitly mentioned.
We ran into a lot, a lot, of problems right from the onset due to the size of the dataset we chose. When we tried to perform some of the analysis we had planned, we would pretty much 9 times out of 10 crash our Google Colab environment with an out-of-RAM error.
This naturally led to us to try to take advantage of the distributed computing powers of Apache Spark. However, this decision came with it a whole host of problems of its own. In the first stages of this project, we actually had no way of transferring dataframes between our local environment and Spark. This did not seem like a significant issue at first since we could import Pandas in our spark environment and run it there. That is, until we came to find out that matplotlib, our visualization library, did not work in Apache Spark. Therefore we would have to do our analysis in Spark and then find a way to pass it back to our local environment to plot it. We found the solution to this problem by diving into the nuances of Spark cell magics (cell magics allow you to execute in different contexts in Colab) and discovering that there exists a built-in cell magic command to pass a Spark dataframe back to Colab.
Some other aspects that would be interesting to explore include:
- Dive into shared characteristics and insights with businesses based on shared attributes, not only categories.
- How can we optimize our models to generate longer reviews?
- Generate a Neural Network to classify reviews based on ratings, and run that on a dataset of generated reviews.
- Can we leverage positive/negative attributes to make our generated reviews even more realistic?
Thanks for reading about our project! You can check out the backing code in our annotated notebook here.