Monitoring Airbnb reviews over COVID-19 with folium HeatMapWithTime
Introduction
This artile will walk through visualizing the fluctuations of Airbnb busineess affected by COVID-19 pandemic. Intuitively, we might have a rough guess what the curves will be looking like, however, I thought this would be interesting to practice both data processing and visulization when time-series attributes involved, what’s more important, to better explore the data and express the insight to a wide variety of audience in a more approachable manner, an interactive visulization significantly helps.
Data
The dataset was downloaded from InsideAirbnb-Hawaii-data, the analysis results were based on a span of two year from July, 2019 to July, 2021.
Prerequisite
In addition to commonly used python data science packages(numpy, pandas, matplotlib, seaborn), here we also need install plotly
, folium
, chart-studio
, those can be easily installed via pip under conda environment:
$ pip install plotly
$ pip install folium
$ pip install chart-studio
Part 1. Visualizing with matplotlib
and plotly
Inspecting Data
I will mainly working on reviews.csv
and listings.csv
, merging operations will be included later for visulization.
df_review = pd.read_csv('reviews.csv')
df_listing = pd.read_csv('listings.csv')
df_listing contains all the metadata for all the listings and df_review has all the review information for each review associated with its list id. Noticed listing_id
in df_review is id
in df_listing, it will be the key for our merging.
Data processing: the number of reviews
Apparently, there is no obvious feature indicating the fluctuations of Airbnb busineess activity, however, from df_review we can tell that each listing has received multiple comments in various time, with that being said, we can get all the listings that have received any comment on different dates, then the count of the total comments on each date will be good enough to indicate the change of business activity. In addition, we can add any metadata associated with listing based on our needs, for instance, we may consider adding geographic features(neighbourhood
,latitude
,longitude
) since we are creating a folium map later, also this kind of feature helps to segragate the data to better examine the change focusing on a particular area.
Inspect the number of reviews over time
def process_count(df_review, df_listing):
df1 = df_review.drop(['reviewer_id', 'reviewer_name', 'comments'], axis=1)
df2 = df_listing[['id', 'neighbourhood_group_cleansed']]
df3 = pd.merge(df2, df1, left_on='id', right_on='listing_id')
df3 = df3[['date', 'listing_id', 'neighbourhood_group_cleansed']]
df3.date = pd.to_datetime(df3.date, format="%Y-%m-%d")
df3 = df3[df3['date'].isin(pd.date_range("2019-07-10", "2021-07-10"))]
df3 = df3.set_index(df3.date).drop('date', axis=1)
#df3.to_pickle("ts.pkl")
return df3
listing_reveived_review = process_count(df_review, df_listing)
listing_reveived_review
output:
listing_id | neighbourhood_group_cleansed | |
---|---|---|
date | ||
2019-08-19 | 5065 | Hawaii |
2020-02-16 | 5065 | Hawaii |
2020-02-19 | 5065 | Hawaii |
2020-02-28 | 5065 | Hawaii |
2020-03-09 | 5065 | Hawaii |
... | ... | ... |
2021-07-08 | 50599343 | Hawaii |
2021-07-08 | 50682739 | Honolulu |
2021-07-08 | 50710529 | Hawaii |
2021-07-03 | 50736557 | Honolulu |
2021-07-07 | 50752275 | Kauai |
271528 rows × 2 columns
Then we specify the area to Honolulu and count how many listings that have received comments on each date.
def count_listings(df, loc='Honolulu'):
df = df[df.neighbourhood_group_cleansed == loc]
df = df.groupby('date', sort=True)['listing_id'].count().rename('review_count').reset_index().set_index('date')
return df
count_per_day_honolulu = count_listings(listing_reveived_review, loc='Honolulu')
count_per_day_honolulu
output:
review_count | |
---|---|
date | |
2019-07-10 | 163 |
2019-07-11 | 151 |
2019-07-12 | 145 |
2019-07-13 | 164 |
2019-07-14 | 208 |
... | ... |
2021-07-04 | 95 |
2021-07-05 | 142 |
2021-07-06 | 71 |
2021-07-07 | 48 |
2021-07-08 | 24 |
729 rows × 1 columns
Create time variable
# define time variable
ts = count_per_day_honolulu.review_count
Fit moving average
def fit_moving_avg(series, window=5):
"""calculate moving average number of reviews
"""
return series.rolling(window, center=True).mean()
avg_ts = fit_moving_avg(ts)
Visualization
via matplotlib
# plot time variable
ts.plot(figsize=(16, 4), label='Number of reviews', title='Number of Reviews over time', fontsize=14, alpha=0.6)
# plot moving average
avg_ts.plot(label='Average number of reviews', fontsize=14)
plt.legend(fontsize=14)
#plt.savefig('moving_avg.png')
output:
Insights
- This analysis takes the number of reviews per day as an indicator of Airbnb business activity. It dramaticly decreased after outbreak of COVID-19 in March, 2020.
- The number of reviews keeps increasing which indicates the popularity of Airbnb was thriving before the pandemic. Meanwhile there is a clear seasonality pattern before the mid of Feb, 2020.
Interactive visualization with Plotly
# set up for plotly
r = go.Scatter(x=ts.index, y=ts.values,
line=go.scatter.Line(color='red', width = 1), opacity=0.8,
name='Reviews', text=[f'Reviews: {x}' for x in ts.values])
layout = go.Layout(title='Number of Reviews over time', xaxis=dict(title='Date'),
yaxis=dict(title='Count'))
fig1 = go.Figure(data=[r],layout=layout)
iplot(fig1)
output:
Following snippet is used to push the plotly object to plotly express
, which generates embedding information for hosting the interactive image on web pages.
username = '' # your username
api_key = '' # your api key for plotly express - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
py.plot(fig1, filename = 'review_over_time', auto_open=True)
Part 2. Reviews via folium HeatMapWithTime
plugin
Data processing
Count reviews received for each listing each month
As you probably already know, folium
creates great interactive maps for visualization, here I am going to create a heat map along with time stamp by using its HeatMapWithTime
plugin, before that there still some processing work need to complete in order to fit our data to the plugin. A simple Demo.
First, count how many reviews each listing received each day, then change the time range to Month, that is to record how many reviews each listing received each month. The reason that change time range from day to month is we don’t want the final display moving too frequently so that we can clearly spot the moving trend.
Also, features like latitude
and longitude
will be kept for rendering the map.
def process_listing_total_count(df_review, df_listing):
# process data for timestamped folium map
df2 = df_review.groupby(['listing_id', 'date'])['id'].count().rename('review_count').reset_index()
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2[df2['date'].isin(pd.date_range("2019-07-10", "2021-07-10"))]
df2 = df2.groupby(['listing_id', pd.Grouper(key='date', freq='M')])['review_count'] \
.sum().reset_index()
merged = pd.merge(df_listing, df2, left_on='id', right_on='listing_id')
#merged.to_pickle("timestamped_review.pkl")
return merged
listings_with_total_review_count = process_listing_total_count(df_review, df_listing)
listings_with_total_review_count[['id','date','review_count','latitude','longitude']]
output:
id | date | review_count | latitude | longitude | |
---|---|---|---|---|---|
0 | 5065 | 2019-08-31 | 1 | 20.042660 | -155.432590 |
1 | 5065 | 2020-02-29 | 3 | 20.042660 | -155.432590 |
2 | 5065 | 2020-03-31 | 2 | 20.042660 | -155.432590 |
3 | 5269 | 2019-07-31 | 1 | 20.027400 | -155.702000 |
4 | 5269 | 2019-09-30 | 3 | 20.027400 | -155.702000 |
... | ... | ... | ... | ... | ... |
116719 | 50599343 | 2021-07-31 | 2 | 19.607210 | -155.976120 |
116720 | 50682739 | 2021-07-31 | 1 | 21.286110 | -157.840150 |
116721 | 50710529 | 2021-07-31 | 1 | 19.700066 | -155.073502 |
116722 | 50736557 | 2021-07-31 | 1 | 21.360280 | -158.048220 |
116723 | 50752275 | 2021-07-31 | 1 | 22.220010 | -159.476470 |
116724 rows × 5 columns
Transform data to the form that folium
can take
For next step, our dataframe should be looking like below:
time-index | latitude | longitude | review_count |
---|---|---|---|
2019-7-31 | a list of latitudes | a list of longitude | a list of review_count |
2019-8-30 | a list of latitudes | a list of longitude | a list of review_count |
... | ... | ... | ... |
2021-7-31 | a list of latitudes | a list of longitude | a list of review_count |
review_count_time_map = listings_with_total_review_count.drop(['neighbourhood_group_cleansed', 'listing_id'], axis=1)
review_count_time_map = review_count_time_map.groupby('date').agg(lambda x: list(x))
review_count_time_map[['latitude','longitude','review_count']]
output:
latitude | longitude | review_count | |
---|---|---|---|
date | |||
2019-07-31 | [20.0274, 19.43081, 21.88151, 22.2208, 21.8813... | [-155.702, -155.88069, -159.47346, -159.46989,... | [1, 1, 4, 1, 3, 3, 1, 1, 2, 3, 1, 2, 1, 2, 1, ... |
2019-08-31 | [20.04266, 19.56604, 21.88151, 22.2208, 21.881... | [-155.43259, -155.96199, -159.47346, -159.4698... | [1, 2, 4, 2, 3, 5, 2, 3, 3, 3, 1, 1, 2, 1, 1, ... |
2019-09-30 | [20.0274, 21.88151, 22.2208, 21.88139, 19.6066... | [-155.702, -159.47346, -159.46989, -159.47248,... | [3, 4, 1, 3, 5, 4, 6, 4, 2, 2, 2, 3, 2, 1, 1, ... |
2019-10-31 | [20.0274, 19.43081, 19.56604, 21.88151, 22.220... | [-155.702, -155.88069, -155.96199, -159.47346,... | [2, 2, 1, 2, 1, 2, 4, 3, 1, 4, 3, 3, 4, 2, 1, ... |
2019-11-30 | [19.43081, 19.56604, 21.88151, 22.2208, 21.881... | [-155.88069, -155.96199, -159.47346, -159.4698... | [1, 1, 5, 1, 6, 1, 3, 2, 2, 2, 1, 2, 1, 3, 1, ... |
2019-12-31 | [19.43081, 19.56604, 21.88151, 22.2208, 21.881... | [-155.88069, -155.96199, -159.47346, -159.4698... | [1, 1, 1, 1, 4, 5, 2, 3, 1, 2, 1, 1, 2, 2, 2, ... |
2020-01-31 | [19.43081, 19.56604, 21.88151, 22.2208, 21.881... | [-155.88069, -155.96199, -159.47346, -159.4698... | [2, 1, 2, 2, 3, 1, 4, 2, 2, 1, 3, 1, 1, 4, 4, ... |
2020-02-29 | [20.04266, 19.43081, 21.88151, 21.88139, 19.60... | [-155.43259, -155.88069, -159.47346, -159.4724... | [3, 2, 2, 2, 1, 3, 2, 3, 2, 3, 4, 2, 1, 2, 2, ... |
2020-03-31 | [20.04266, 20.0274, 19.43081, 19.56604, 21.881... | [-155.43259, -155.702, -155.88069, -155.96199,... | [2, 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 1, 2, 2, ... |
2020-04-30 | [20.89861, 19.60885, 21.58052, 20.76291, 20.76... | [-156.68151, -155.96764, -158.10854, -156.4573... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... |
2020-05-31 | [19.57365, 20.72916, 20.76291, 20.87319, 19.72... | [-155.96716, -156.45055, -156.45734, -156.6745... | [1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... |
2020-06-30 | [19.57365, 21.28334, 20.72916, 19.43428, 19.81... | [-155.96716, -157.8379, -156.45055, -155.21609... | [2, 2, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 2, ... |
2020-07-31 | [21.88151, 19.60668, 20.89861, 19.39402, 19.52... | [-159.47346, -155.97585, -156.68151, -154.9306... | [3, 2, 3, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, ... |
2020-08-31 | [21.88151, 19.52067, 21.27437, 19.48092, 21.28... | [-159.47346, -154.84706, -157.82043, -155.9064... | [2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 1, 1, 3, ... |
2020-09-30 | [19.39402, 19.45962, 21.27918, 21.64094, 19.48... | [-154.9306, -155.88118, -157.82846, -158.06281... | [1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 4, 1, 1, 1, 2, ... |
2020-10-31 | [20.72413, 19.52373, 19.57365, 19.59802, 19.48... | [-156.44767, -154.84746, -155.96716, -154.9389... | [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, ... |
2020-11-30 | [21.88151, 22.2208, 22.21789, 20.89861, 20.757... | [-159.47346, -159.46989, -159.47184, -156.6815... | [1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 3, 2, 2, 1, 1, ... |
2020-12-31 | [19.56604, 21.88151, 22.2208, 19.39402, 19.520... | [-155.96199, -159.47346, -159.46989, -154.9306... | [1, 1, 1, 2, 2, 4, 4, 2, 2, 2, 3, 1, 1, 3, 1, ... |
2021-01-31 | [20.0274, 19.56604, 20.72413, 19.39402, 20.757... | [-155.702, -155.96199, -156.44767, -154.9306, ... | [1, 2, 2, 1, 3, 2, 1, 1, 1, 1, 4, 3, 2, 3, 1, ... |
2021-02-28 | [19.43081, 19.56604, 19.60668, 20.72413, 20.89... | [-155.88069, -155.96199, -155.97585, -156.4476... | [2, 1, 3, 1, 1, 1, 3, 2, 2, 3, 2, 2, 3, 1, 1, ... |
2021-03-31 | [19.43081, 19.56604, 22.2208, 19.60668, 22.217... | [-155.88069, -155.96199, -159.46989, -155.9758... | [3, 2, 1, 3, 1, 3, 5, 4, 2, 1, 2, 3, 3, 3, 3, ... |
2021-04-30 | [20.0274, 19.43081, 19.56604, 21.88151, 22.220... | [-155.702, -155.88069, -155.96199, -159.47346,... | [1, 1, 1, 1, 1, 1, 2, 3, 2, 1, 3, 3, 3, 2, 1, ... |
2021-05-31 | [19.43081, 19.56604, 21.88151, 22.2208, 21.881... | [-155.88069, -155.96199, -159.47346, -159.4698... | [1, 1, 2, 5, 3, 3, 4, 2, 2, 4, 4, 2, 3, 2, 1, ... |
2021-06-30 | [19.43081, 19.56604, 21.88151, 22.2208, 21.881... | [-155.88069, -155.96199, -159.47346, -159.4698... | [1, 3, 2, 3, 2, 1, 4, 2, 1, 2, 2, 3, 1, 2, 2, ... |
2021-07-31 | [22.21789, 19.39402, 19.59074, 19.52067, 22.22... | [-159.47184, -154.9306, -155.97143, -154.84706... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ... |
Lastly, we just extract the repective latitude
, longitude
, and review_count
to a final form for pass in to the plugin.
# transform data for folium
def generate_location_points(all_points):
# all_points = pd.read_pickle("timestamped_review.pkl")
# loc_points = all_points[all_points.neighbourhood_group_cleansed == location]
loc_points = all_points.drop(['neighbourhood_group_cleansed', 'listing_id'], axis=1)
loc_points = loc_points.groupby('date').agg(lambda x: list(x))
to_draw = []
for i in range(loc_points.shape[0]):
single_draw = []
for j in list(zip(loc_points.iloc[i].latitude, loc_points.iloc[i].longitude, loc_points.iloc[i].review_count)):
single_draw.append(list(j))
to_draw.append(single_draw)
time_index = []
for t in loc_points.index:
time_index.append(t.strftime("%Y-%m-%d"))
return to_draw, time_index
points, indices = generate_location_points(listings_with_total_review_count)[0], \
generate_location_points(listings_with_total_review_count)[1]
Visualization - click display button
# create folium object and add timestamp object
time_map = folium.Map([21.3487, -157.944], zoom_start=10.5)
hm = plugins.HeatMapWithTime(points, index=indices, auto_play=True, max_opacity=0.6)
hm.add_to(time_map)
# display map
#time_map
#time_map.save("index.html")
output:
Insight
- The heat map above indicates how the number of reviews of Airbnb listings changed on Oahu island during 2019-7 to 2021-7
- Time variable is incremented by month, this can be adjusted to a bigger of smaller increment.
- The trend showed from above matches the result acquired by matplotlib.
- Moreover, timestamped visualization delivers a more approachable result to a variaty of audiences.
- By taking advantage of geographaic information, we could monitor many other attributes as long as we render them with suitable weights.