Tokyo Airbnb Data Analytics

Overview

I had planned to visit Japan for the Tokyo 2020 Summer Olympic Games before the global COVID-19 pandemic happened. I have used Airbnb in Tokyo a few times and thought of using it again during the Olympics. I just wanted to explore and analyze the Tokyo Airbnb dataset to learn more about Airbnb in Tokyo. I have used Python and its frameworks, including scikit-learn, Pandas, Matplotlib, and NLTK, to analyze the dataset.

This data analytics project consists of the following:

Exploratory Data Analysis (EDA)
- I processed and visualized the data to perform EDA.
Price Prediction Modeling
- I created a linear regression model with an adjusted R squared of 0.44 and a random forest model with an adjusted R squared of 0.87 for predicting listing prices using 15,551 Tokyo Airbnb listings.
- The root mean squared errors (RMSE) of the linear regression and the random forest models were 6,567 yen and 4,834 yen, respectively.
- I performed a 5-fold cross-validation on the random forest model and got an adjusted R squared of 0.71, which indicates that the model was overfitting.
Sentiment Analysis
- N-grams showed that highly rated listings were clean, in a great location, close to a train station, close to a convenience store, or had a great host.
- I created a logistic regression model with an accuracy score of 83.9% to perform sentiment analysis on 416,396 Tokyo Airbnb reviews. The model classifies listing reviews into 5 sentiment classes, very bad, bad, neutral, good, and very good.
- I performed a 5-fold cross-validation on the logistic regression model and got the mean accuracy score of 83.5%, which means that the model generalized well.

Code

My code can be found here:

https://github.com/masaki9/Tokyo-Airbnb

Dataset

I used the Tokyo Airbnb dataset compiled on February 29, 2020, that I downloaded from Inside Airbnb. The dataset includes data on Tokyo Airbnb listings and reviews.

Exploratory Data Analysis

Summary

The summary of my findings is as follows:

The average listing price is 23,982 yen.
The listing prices above 48,000 yen are outliers.
Most listing prices are less than 40,000 yen.
The top five neighbourhoods by the number of listings are Shinjuku, Taito, Toshima, Sumida, and Shibuya. These five neighbourhoods are popular among tourists, so it is not surprising to see many listings.
Most listings are entire homes/apartments (69%) and private rooms (21%).
Tokyo Airbnb hosts are stricter than typical hotels in Tokyo regarding cancellations.

Data Visualization

Average Price by Neighborhood

The figure below shows the average price of listings by neighbourhood. There are 55 neighbourhoods listed in Toyko Airbnb.

Average Price by Neighborhood

Top 10 Neighborhoods by Average Price

The figure below shows the top 10 neighbourhoods by the average price of listings.

Top 10 Neighborhoods by Average Price

Top 10 Neighborhoods by Number of Listings

The figure below shows the top 10 neighbourhoods by the number of listings.

Top 10 Neighborhoods by Number of Listings

Box Plot for Listing Prices by Neighbour

The figure below shows many outliers above the box plots’ upper whiskers, mostly above 48,000 yen.

Box Plot for Listing Prices by Neighbour

Price Distribution Plot

The figure below shows that most listing prices are less than 40,000 yen.

Price Distribution Plot

Distribution of Types of Rooms

The figure below shows that most of the listings are entire homes/apartments and private rooms.

Distribution of Types of Rooms

Distribution of Cancellation Policies

The figure below shows that most cancellation policies are strict and moderate. In Japan, it is common to have flexible cancellations for hotels and get a 100% refund, but Tokyo Airbnb hosts seem to have more strict policies.

Distribution of Cancellation Policies

Price Prediction Modeling

I developed a linear regression model and a random forest model for price predictions using 15,551 Tokyo Airbnb listings.

Feature Engineering

I did the following data processing after performing EDA:

I removed outlier listings above 48,000 yen.
I removed unnecessary columns such as host_id and scrape_id.
I filled in the missing values in numerical columns with the median values.

Correlated Features for Listing Prices

I found that the “accommodates”, “guests_included”, “cleaning_fee”, “beds”, and “bedrooms” features were the five most correlated features for listing prices”, but per the scatter plot below, they didn’t seem to be highly correlated. I ended up keeping every feature in my models.

Correlated Features for Listing Prices

Scatter Plot Matrix for Price Related Features

Results

The results of my linear regression and random forest models are in the table below:

Model	Adjusted R Squared	Adjusted R Squared (Cross-Validation)	RMSE
Linear Regression	0.4407	0.4366	6,567 yen
Random Forest	0.8749	0.7124	4,833 yen

The table shows that the linear regression model is not statistically significant, and the random forest model is overfitting.

Future Improvement

I only used the numerical columns in the dataset for price prediction modelling. I could use categorical columns and more feature engineering to improve my models.

Sentiment Analysis

Sentimental analysis is the interpretation of text data using natural language processing (NLP) and text analysis techniques. I performed sentiment analysis on 416,396 Tokyo Airbnb reviews.

Summary Statistics

Count: 416,394 Reviews
Average Review Scores Rating: 93.7
Maximum Review Scores Rating: 100.0
Minimum Review Scores Rating: 20.0

Rating Distribution

The figure below shows a left-skewed distribution with way more highly rated reviews.

Rating Distribution

Text Data Processing

I did the following to process the reviews text data:

Tokenization
- I removed punctuations and split each review into an array of words.
Removing Stop Words
- I removed insignificant words such as articles and prepositions.
Stemming
- I reduced words to the root form.

N-gram Analysis

I used n-grams with sizes of 1 to 4 for modelling and analysis. The following are some of the characteristics of Tokyo Airbnb listings with high review ratings:

Great place
Great location
Close to train station
Within walking distance
Clean place
Convenience store
Great Airbnb host
Airbnb that they can recommend
Airbnb that they enjoyed

Top 20 N-grams for Review Scores Between 90 and 100

Logistic Regression Model

I developed a logistic regression model to predict how users feel about Tokyo Airbnb listings using 416,396 Tokyo Airbnb reviews. The model classifies each review into Very Bad, Bad, Neutral, Good, or Very Good.

Discretization

Review ratings are continuous and range from 0.0 to 100.0 in the dataset. To perform classification, I discretized the review ratings into the following:

0 to 39 → 0 (Very Bad)
40 to 59 → 1 (Bad)
60 to 79 → 2 (Neutral)
80 to 89 → 3 (Good)
90 to 100 → 4 (Very Good)

Model Evaluation

I created a confusion matrix to evaluate my logistic regression model, which shows the following:

Number of Accurate Predictions: 35 + 1316 + 103350 = 104,701
Total Number of Predictions = 124,821
Accuracy (Precision) = 104,701 / 124,821 = 0.8388 (approx. 83.9% accuracy)

Confusion Matrix

I subsequently performed a 5-fold cross-validation on the model. I got the mean accuracy score of 83.5%, which is similar to the accuracy score of 83.9% mentioned earlier and means that the model generalized well.