Overview
I had planned to visit Japan for the Tokyo 2020 Summer Olympic Games before the global COVID-19 pandemic happened. I have used Airbnb in Tokyo a few times and thought of using it again during the Olympics. I just wanted to explore and analyze the Tokyo Airbnb dataset to learn more about Airbnb in Tokyo. I have used Python and its frameworks, including scikit-learn, Pandas, Matplotlib, and NLTK, to analyze the dataset.
This data analytics project consists of the following:
- Exploratory Data Analysis (EDA)
- I processed and visualized the data to perform EDA.
- Price Prediction Modeling
- I created a linear regression model with an adjusted R squared of 0.44 and a random forest model with an adjusted R squared of 0.87 for predicting listing prices using 15,551 Tokyo Airbnb listings.
- The root mean squared errors (RMSE) of the linear regression and the random forest models were 6,567 yen and 4,834 yen, respectively.
- I performed a 5-fold cross-validation on the random forest model and got an adjusted R squared of 0.71, which indicates that the model was overfitting.
- Sentiment Analysis
- N-grams showed that highly rated listings were clean, in a great location, close to a train station, close to a convenience store, or had a great host.
- I created a logistic regression model with an accuracy score of 83.9% to perform sentiment analysis on 416,396 Tokyo Airbnb reviews. The model classifies listing reviews into 5 sentiment classes, very bad, bad, neutral, good, and very good.
- I performed a 5-fold cross-validation on the logistic regression model and got the mean accuracy score of 83.5%, which means that the model generalized well.
Code
My code can be found here:
Dataset
I used the Tokyo Airbnb dataset compiled on February 29, 2020, that I downloaded from Inside Airbnb. The dataset includes data on Tokyo Airbnb listings and reviews.
Exploratory Data Analysis
Summary
The summary of my findings is as follows:
- The average listing price is 23,982 yen.
- The listing prices above 48,000 yen are outliers.
- Most listing prices are less than 40,000 yen.
- The top five neighbourhoods by the number of listings are Shinjuku, Taito, Toshima, Sumida, and Shibuya. These five neighbourhoods are popular among tourists, so it is not surprising to see many listings.
- Most listings are entire homes/apartments (69%) and private rooms (21%).
- Tokyo Airbnb hosts are stricter than typical hotels in Tokyo regarding cancellations.
Data Visualization
Average Price by Neighborhood
The figure below shows the average price of listings by neighbourhood. There are 55 neighbourhoods listed in Toyko Airbnb.
Average Price by Neighborhood
Top 10 Neighborhoods by Average Price
The figure below shows the top 10 neighbourhoods by the average price of listings.
Top 10 Neighborhoods by Average Price
Top 10 Neighborhoods by Number of Listings
The figure below shows the top 10 neighbourhoods by the number of listings.
Top 10 Neighborhoods by Number of Listings
Box Plot for Listing Prices by Neighbour
The figure below shows many outliers above the box plots’ upper whiskers, mostly above 48,000 yen.
Box Plot for Listing Prices by Neighbour
Price Distribution Plot
The figure below shows that most listing prices are less than 40,000 yen.
Price Distribution Plot
Distribution of Types of Rooms
The figure below shows that most of the listings are entire homes/apartments and private rooms.
Distribution of Types of Rooms
Distribution of Cancellation Policies
The figure below shows that most cancellation policies are strict and moderate. In Japan, it is common to have flexible cancellations for hotels and get a 100% refund, but Tokyo Airbnb hosts seem to have more strict policies.
Distribution of Cancellation Policies
Price Prediction Modeling
I developed a linear regression model and a random forest model for price predictions using 15,551 Tokyo Airbnb listings.
Feature Engineering
I did the following data processing after performing EDA:
- I removed outlier listings above 48,000 yen.
- I removed unnecessary columns such as host_id and scrape_id.
- I filled in the missing values in numerical columns with the median values.
Correlated Features for Listing Prices
I found that the “accommodates”, “guests_included”, “cleaning_fee”, “beds”, and “bedrooms” features were the five most correlated features for listing prices”, but per the scatter plot below, they didn’t seem to be highly correlated. I ended up keeping every feature in my models.
Correlated Features for Listing Prices
Scatter Plot Matrix for Price Related Features
Results
The results of my linear regression and random forest models are in the table below:
Model | Adjusted R Squared | Adjusted R Squared (Cross-Validation) | RMSE |
---|---|---|---|
Linear Regression | 0.4407 | 0.4366 | 6,567 yen |
Random Forest | 0.8749 | 0.7124 | 4,833 yen |
The table shows that the linear regression model is not statistically significant, and the random forest model is overfitting.
Future Improvement
I only used the numerical columns in the dataset for price prediction modelling. I could use categorical columns and more feature engineering to improve my models.
Sentiment Analysis
Sentimental analysis is the interpretation of text data using natural language processing (NLP) and text analysis techniques. I performed sentiment analysis on 416,396 Tokyo Airbnb reviews.
Summary Statistics
- Count: 416,394 Reviews
- Average Review Scores Rating: 93.7
- Maximum Review Scores Rating: 100.0
- Minimum Review Scores Rating: 20.0
Rating Distribution
The figure below shows a left-skewed distribution with way more highly rated reviews.
Rating Distribution
Text Data Processing
I did the following to process the reviews text data:
- Tokenization
- I removed punctuations and split each review into an array of words.
- Removing Stop Words
- I removed insignificant words such as articles and prepositions.
- Stemming
- I reduced words to the root form.
N-gram Analysis
I used n-grams with sizes of 1 to 4 for modelling and analysis. The following are some of the characteristics of Tokyo Airbnb listings with high review ratings:
- Great place
- Great location
- Close to train station
- Within walking distance
- Clean place
- Convenience store
- Great Airbnb host
- Airbnb that they can recommend
- Airbnb that they enjoyed
Top 20 N-grams for Review Scores Between 90 and 100
Logistic Regression Model
I developed a logistic regression model to predict how users feel about Tokyo Airbnb listings using 416,396 Tokyo Airbnb reviews. The model classifies each review into Very Bad, Bad, Neutral, Good, or Very Good.
Discretization
Review ratings are continuous and range from 0.0 to 100.0 in the dataset. To perform classification, I discretized the review ratings into the following:
- 0 to 39 → 0 (Very Bad)
- 40 to 59 → 1 (Bad)
- 60 to 79 → 2 (Neutral)
- 80 to 89 → 3 (Good)
- 90 to 100 → 4 (Very Good)
Model Evaluation
I created a confusion matrix to evaluate my logistic regression model, which shows the following:
- Number of Accurate Predictions: 35 + 1316 + 103350 = 104,701
- Total Number of Predictions = 124,821
- Accuracy (Precision) = 104,701 / 124,821 = 0.8388 (approx. 83.9% accuracy)
Confusion Matrix
I subsequently performed a 5-fold cross-validation on the model. I got the mean accuracy score of 83.5%, which is similar to the accuracy score of 83.9% mentioned earlier and means that the model generalized well.