The Analysis of Restaurant Ratings and Their Locations¶
Abstract¶
As students, we often find ourselves deciding where to dine out. When looking for new restaurants, one of the first things we do is peruse yelp ratings/reviews to see nearby restaurants and others' dining experiences at those restaurants. Knowing this, we were curious as to whether businesses in proximity to universities have better/worse ratings and whether whether students patrons rated restaurants lower/higher relative to other customers.
To focus our study on venues frequented by students, we first filtered yelp's dataset to include businesses under the "Restaurant" and "Coffee & Tea". Then, we selected several universities in the US and applied haversine's formula to calculate the distance between each university and the restaurants within the same city. For our project, we defined "close to a university" as under 5 kilometers, with anything above being defined as "not close to a university." This distance was chosen due to its accessibility via public transportation and walking, aligning with typical student mobility. Our analysis revealed a trend: restaurants proximate to universities tend to have marginally higher ratings than those farther away. While the discrepancy is subtle, it's discernible and statistically significant, suggesting a non-random association.
Transitioning to the next section, we wanted to answer the question: "Does distance from university affect a restaurants's rating?" Perhaps restaurants that are near universities have better food beacuse the lease is more expensive and they need more customers to sustain their business. For this question, we answer it using linear regression. We found that while distance from university affects rating, the effect is basically negligible.
Finally, we wanted to see whether students tend to give better/worse ratings than other customers, so we used keywords to classify whether a review was written by a student or not. For this question, we also answer it using permutation tests to see whether any differences are caused by chance. Our conclusion was that students give lower ratings and the differences between students and non-students are not caused by chance.
Research Question¶
Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?
Background and Prior Work¶
Colleges and universities host many students every year and as the number of students grow business around these places also grow, some even forming college towns where businesses’ livelihoods depend on students like at UC Davis. As students of UCSD, we were curious as to specifically restaurants in the vicinity of a college or university. As a restaurant’s success or profits are hard to measure and obtain, we settled on trying to see if restaurants’ ratings are affected by their vicinity to the university in their city.
One article published by QSR Magazine seems to indicate so, citing data from “College & University Keynote Report” from Datassential saying that 58% of students eat off campus and 49% of students consider themselves foodies and are more conscious about what they eat.1 This means that more than half of the student body is regularly eating at restaurants around their campus and reviewing and recommending those restaurants. Another study that more closely looks at customer satisfaction with food service, points out several important factors that can contribute to a restaurant’s rating such as their food quality, service quality, decor quality, and most importantly price.2 The conclusions they found is that good service and then good food were the best indicators of a high customer satisfaction. These point out significant attributes that aren’t considered in our research question, and possibly could be much greater of a factor of a restaurant’s rating than distance.
- Baltazar, Amanda. “Restaurants Would Be Wise to Court College Students.” QSR Magazine, 7 July 2023, https://www.qsrmagazine.com/operations/business-advice/restaurants-would-be-wise-court-college-students/
- Serhan, Mireille, and Carole Serhan. “The Impact of Food Service Attributes on Customer Satisfaction in a Rural University Campus Environment.” International Journal of Food Science, Hindawi, 31 Dec. 2019, https://www.hindawi.com/journals/ijfs/2019/2154548/
Hypothesis¶
We believe that restaurants further from the university/college campuses will have similar ratings as those closer to university/college campuses. While an influx of students in the area can have an impact on nearby restaurants, we believe that this impact will be minimal and negligible. This is because students don’t make up a majority of a restaurants’ clientele, especially in more metropolitan areas. There are many other customers that either live in the area or are traveling that can give ratings to restaurants.
Data¶
- Dataset Name: Yelp Academic Dataset
- Link to the dataset: https://www.yelp.com/dataset
- Number of observations: 6990280 reviews, 150346 businesses
- Number of variables: 9, 14
This dataset contains information on a selection of businesses, reviews, and user data centered around different metropolitan areas from the app Yelp. It is separated into 5 different json files, businesses, reviews, checkin, tip, user. We only particularly care about the businesses and reviews. For businesses, the variables we care about are the business_id, city, longitude, latitude, stars, review_count. For reviews, the variables we care about are the business_id, stars and text.
Yelp Academic Dataset¶
# import necessary libraries
import numpy as np
import pandas as pd
First let's load the business data into a dataframe.
# load business data
business = pd.read_json('https://drive.usercontent.google.com/download?id=1HGtRB3g1Hx1t1j2vPqCdTEfG-WJtTFVN&confirm=xxx', lines=True)
Let's drop all the observations with missing values in the important columns
business = business.dropna(subset=['latitude', 'longitude', 'stars', 'review_count', 'categories']) # this changes nothing though
business.head()
business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Pns2l4eNsfO8kk83dixA6A | Abby Rappoport, LAC, CMQ | 1616 Chapala St, Ste 2 | Santa Barbara | CA | 93101 | 34.426679 | -119.711197 | 5.0 | 7 | 0 | {'ByAppointmentOnly': 'True'} | Doctors, Traditional Chinese Medicine, Naturop... | None |
1 | mpf3x-BjTdTEA3yCZrAYPw | The UPS Store | 87 Grasso Plaza Shopping Center | Affton | MO | 63123 | 38.551126 | -90.335695 | 3.0 | 15 | 1 | {'BusinessAcceptsCreditCards': 'True'} | Shipping Centers, Local Services, Notaries, Ma... | {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ... |
2 | tUFrWirKiKi_TAnsVWINQQ | Target | 5255 E Broadway Blvd | Tucson | AZ | 85711 | 32.223236 | -110.880452 | 3.5 | 22 | 0 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Department Stores, Shopping, Fashion, Home & G... | {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ... |
3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
4 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 4.5 | 13 | 1 | {'BusinessAcceptsCreditCards': 'True', 'Wheelc... | Brewpubs, Breweries, Food | {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2... |
Restaurants and cafes are the businesses that we care about, so let's filter our business dataframe by category.
def identify_restaurants(data, keywords):
keywords = [keyword.lower() for keyword in keywords]
def check_categories(category):
return any(keyword in category.lower() for keyword in keywords)
return data[data['categories'].apply(check_categories)]
keywords = ['Restaurants', 'Coffee & Tea']
business = identify_restaurants(business, keywords)
business
business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
5 | CF33F8-E6oudUQ46HnavjQ | Sonic Drive-In | 615 S Main St | Ashland City | TN | 37015 | 36.269593 | -87.058943 | 2.0 | 6 | 1 | {'BusinessParking': 'None', 'BusinessAcceptsCr... | Burgers, Fast Food, Sandwiches, Food, Ice Crea... | {'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '... |
8 | k0hlBqXX-Bt0vf1op7Jr1w | Tsevi's Pub And Grill | 8025 Mackenzie Rd | Affton | MO | 63123 | 38.565165 | -90.321087 | 3.0 | 19 | 0 | {'Caters': 'True', 'Alcohol': 'u'full_bar'', '... | Pubs, Restaurants, Italian, Bars, American (Tr... | None |
9 | bBDDEgkFA1Otx9Lfe7BZUQ | Sonic Drive-In | 2312 Dickerson Pike | Nashville | TN | 37207 | 36.208102 | -86.768170 | 1.5 | 10 | 1 | {'RestaurantsAttire': ''casual'', 'Restaurants... | Ice Cream & Frozen Yogurt, Fast Food, Burgers,... | {'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '... |
11 | eEOYSgkmpB90uNA7lDOMRA | Vietnamese Food Truck | Tampa Bay | FL | 33602 | 27.955269 | -82.456320 | 4.0 | 10 | 1 | {'Alcohol': ''none'', 'OutdoorSeating': 'None'... | Vietnamese, Food, Restaurants, Food Trucks | {'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
150327 | cM6V90ExQD6KMSU3rRB5ZA | Dutch Bros Coffee | 1181 N Milwaukee St | Boise | ID | 83704 | 43.615401 | -116.284689 | 4.0 | 33 | 1 | {'WiFi': ''free'', 'RestaurantsGoodForGroups':... | Cafes, Juice Bars & Smoothies, Coffee & Tea, R... | {'Monday': '0:0-0:0', 'Tuesday': '0:0-17:0', '... |
150328 | 1jx1sfgjgVg0nM6n3p0xWA | Savaya Coffee Market | 11177 N Oracle Rd | Oro Valley | AZ | 85737 | 32.409552 | -110.943073 | 4.5 | 41 | 1 | {'BusinessParking': '{'garage': False, 'street... | Specialty Food, Food, Coffee & Tea, Coffee Roa... | {'Monday': '0:0-0:0', 'Tuesday': '6:0-14:0', '... |
150336 | WnT9NIzQgLlILjPT0kEcsQ | Adelita Taqueria & Restaurant | 1108 S 9th St | Philadelphia | PA | 19147 | 39.935982 | -75.158665 | 4.5 | 35 | 1 | {'WheelchairAccessible': 'False', 'Restaurants... | Restaurants, Mexican | {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... |
150339 | 2O2K6SXPWv56amqxCECd4w | The Plum Pit | 4405 Pennell Rd | Aston | DE | 19014 | 39.856185 | -75.427725 | 4.5 | 14 | 1 | {'RestaurantsDelivery': 'False', 'BusinessAcce... | Restaurants, Comfort Food, Food, Food Trucks, ... | {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W... |
150340 | hn9Toz3s-Ei3uZPt7esExA | West Side Kebab House | 2470 Guardian Road NW | Edmonton | AB | T5T 1K8 | 53.509649 | -113.675999 | 4.5 | 18 | 0 | {'Ambience': '{'touristy': False, 'hipster': F... | Middle Eastern, Restaurants | {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... |
54918 rows × 14 columns
business['city'].value_counts()[:30]
city Philadelphia 6171 Tampa 3119 Indianapolis 3004 Tucson 2623 Nashville 2612 New Orleans 2369 Edmonton 2321 Saint Louis 1836 Reno 1396 Boise 912 Santa Barbara 812 Clearwater 712 Wilmington 638 St. Louis 573 Metairie 543 Saint Petersburg 523 Franklin 463 St. Petersburg 427 Sparks 363 Brandon 342 Meridian 338 Largo 327 Carmel 314 Cherry Hill 313 West Chester 292 New Port Richey 238 Kenner 236 Goleta 232 Greenwood 221 Fishers 218 Name: count, dtype: int64
From above we can see that the dataset has data for restaurants from many different cities. We will pick a few and find universities/colleges within to base our research around. Our initial selection was based on the top cities (with most reviews) in our dataset. However, upon further analysis, we identified that 51% of the cities in the top 29 in our dataset didn't have universities that fit our research criteria. We then refined our list to only include cities with universities in the city limits.
universities = pd.read_csv('./yelp_dataset/universities.csv')
universities = universities.set_index('City')
universities
Country | State | University Name | Latitude | Longitude | |
---|---|---|---|---|---|
City | |||||
Philadelphia | USA | Pennsylvania | University of Pennsylvania | 39.952583 | -75.191975 |
Tucson | USA | Arizona | University of Arizona Tuscon | 32.221664 | -110.948922 |
Tampa | USA | Florida | University of South Florida | 28.051836 | -82.400005 |
Indianapolis | USA | Indiana | Purdue University | 39.776709 | -86.170811 |
Nashville | USA | Tennessee | Vanderbilt University | 36.145532 | -86.804060 |
New Orleans | USA | Louisiana | Tulane University | 29.958586 | -90.064997 |
Reno | USA | Nevada | University of Nevada | 39.534642 | -119.812831 |
Edmonton | Canada | Alberta | University of Alberta | 53.523219 | -113.523219 |
St. Louis | USA | Missouri | Washington University in St. Louis | 38.628900 | -90.307200 |
Santa Barbara | USA | California | UC Santa Barbara | 34.413953 | -119.848956 |
Let's also filter our business dataframe to only include businesses inside these cities.
business = business[business['city'].isin(universities.index)]
business['city'].value_counts()
city Philadelphia 6171 Tampa 3119 Indianapolis 3004 Tucson 2623 Nashville 2612 New Orleans 2369 Edmonton 2321 Reno 1396 Santa Barbara 812 St. Louis 573 Name: count, dtype: int64
We can then use this information on universities to calculate whether a business is close or far from a university using their latitude and longitude positions.
# this is a lat long distance calculator from https://community.esri.com/t5/coordinate-reference-systems-blog/distance-on-a-sphere-the-haversine-formula/ba-p/902128#:~:text=All%20of%20these%20can%20be,longitude%20of%20the%20two%20points
def haversine(coord1, coord2):
import math
# Coordinates in decimal degrees (e.g. 2.89078, 12.79797)
lon1, lat1 = coord1
lon2, lat2 = coord2
R = 6371000 # radius of Earth in meters
phi_1 = math.radians(lat1)
phi_2 = math.radians(lat2)
delta_phi = math.radians(lat2 - lat1)
delta_lambda = math.radians(lon2 - lon1)
a = math.sin(delta_phi / 2.0)**2 + math.cos(phi_1) * math.cos(phi_2) * math.sin(delta_lambda / 2.0)**2
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
meters = R * c # output distance in meters
km = meters / 1000.0 # output distance in kilometers
meters = round(meters, 3)
km = round(km, 3)
# print(f"Distance: {meters} m")
# print(f"Distance: {km} km")
return km
def calc_distance(df):
new_df = df.copy()
lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
lat1, long1 = df[['latitude', 'longitude']]
# threshold 5 km
new_df['close_to_university'] = haversine((lat, long), (lat1, long1)) < 5
return new_df
We decided to use a threshold value of 10 km. A business would be considered close to an university/college if it is within 10 km of one and far otherwise.
business = business.apply(calc_distance, axis=1)
Now let's create a dataframe for our reviews. There are simply too many reviews (6990280!) and the kernel cannot handle loading that many observations into memory, thus we have cut down the file using python to just the first 100000 reviews.
review = pd.read_json('https://drive.usercontent.google.com/download?id=1xE5dbDWd1Mp8kFQwoMmtLVj5xPq9tpuG&confirm=xxx', lines=True)
review.head()
review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
---|---|---|---|---|---|---|---|---|---|
0 | KU_O5udG6zpxOg-VcAEodg | mh_-eMZ6K5RLWhZyISBhwA | XQfwVwDr-v0ZS3_CbbE5Xw | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 |
1 | BiTunyQ73aT9WBnpR9DZGw | OyoGAe7OKpv6SyGZT5g77Q | 7ATYjTIgM3jUlt4UM3IypQ | 5 | 1 | 0 | 1 | I've taken a lot of spin classes over the year... | 2012-01-03 15:28:18 |
2 | saUsX_uimxRlCVr67Z4Jig | 8g_iMtfSiwikVnbP2etR0A | YjUWPpI6HXG530lwP-fb2A | 3 | 0 | 0 | 0 | Family diner. Had the buffet. Eclectic assortm... | 2014-02-05 20:30:30 |
3 | AqPFMleE6RsU23_auESxiA | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 |
4 | Sx8TMOWLNuJBWer-0pcmoA | bcjbaE6dDog4jkNY91ncLQ | e4Vwtrqf-wpJfwesgvdgxQ | 4 | 1 | 0 | 1 | Cute interior and owner (?) gave us tour of up... | 2017-01-14 20:54:15 |
Again we drop observations with missing values in the important variables.
review.dropna(subset=['stars', 'text'])
review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
---|---|---|---|---|---|---|---|---|---|
0 | KU_O5udG6zpxOg-VcAEodg | mh_-eMZ6K5RLWhZyISBhwA | XQfwVwDr-v0ZS3_CbbE5Xw | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 |
1 | BiTunyQ73aT9WBnpR9DZGw | OyoGAe7OKpv6SyGZT5g77Q | 7ATYjTIgM3jUlt4UM3IypQ | 5 | 1 | 0 | 1 | I've taken a lot of spin classes over the year... | 2012-01-03 15:28:18 |
2 | saUsX_uimxRlCVr67Z4Jig | 8g_iMtfSiwikVnbP2etR0A | YjUWPpI6HXG530lwP-fb2A | 3 | 0 | 0 | 0 | Family diner. Had the buffet. Eclectic assortm... | 2014-02-05 20:30:30 |
3 | AqPFMleE6RsU23_auESxiA | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 |
4 | Sx8TMOWLNuJBWer-0pcmoA | bcjbaE6dDog4jkNY91ncLQ | e4Vwtrqf-wpJfwesgvdgxQ | 4 | 1 | 0 | 1 | Cute interior and owner (?) gave us tour of up... | 2017-01-14 20:54:15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99995 | pAEbIxvr6ebx2bHc1XvguA | SMH5CeiLvKx61lKwtLZ_PA | lV0k3BnslFRkuWD_kbKd0Q | 4 | 0 | 0 | 0 | Came here for lunch with a group. They were bu... | 2018-05-30 22:28:56 |
99996 | xH1AoE-4nf2ECGQJRjO4_g | 2clTdtp-BjphxLjN83CpUA | G0xz3kyRhRi6oZl7KfR0pA | 1 | 1 | 0 | 0 | The equipment is so old and so felty! I just u... | 2015-04-05 23:31:52 |
99997 | GatIbXTz-WDru5emONUSIg | MRrN6DH3QGCFcDv5RENYVg | C4lZdhasjZVQyDlOiXY1sA | 4 | 0 | 0 | 0 | This is one of my favorite Mexican restaurants... | 2016-06-04 00:59:15 |
99998 | 6NfkodAdhvI89xONXuBC3A | rnNQzeKJbvqVCsYsL10mkQ | dChRGpit9fM_kZK5pafNyA | 2 | 0 | 0 | 0 | Came here for brunch - had an omlette ($19 + t... | 2018-06-11 12:45:08 |
99999 | sJ1BMq7lkKgOWEFx3n6ZRw | _BcWyKQL16ndpBdggh2kNA | hMcgO98QaOFmQVTfCUeGzw | 5 | 0 | 0 | 0 | Came in for my 5-6 month prophy and saw Kara -... | 2013-06-06 10:10:33 |
100000 rows × 9 columns
To figure out which reviews are written by college students, we queried reviews that have any mention of words related to them. To do this, we chose keywords such as student, students, college, colleges, university, universities, and uni. These keywords were chosen in the hopes that the review was written by a person who is mentioning their status as a student to be used as their way of reliability to other reviewees.
We also want to disclose concerns of using such a method. By querying reviews written by students this way, we could be counting false positives and missing false negatives. In this case, a false positive is when a review incorrectly classifies a non-student's review as a student, while a false negative is when a review incorrectly classifies a student's review as a non-student. Here are a few examples:
- False Positive: A non-student brought up a "college" nearby.
- False Positive: The "students" mentioned in a review could be students in highschool, not college.
- False Positive: A non-student talks about how a restaurant is often visited by many "students"
- False Negative: The review could have been written by a student, but the person did not mention that they were a student in their review.
def identify_student_reviews(data, keywords):
keywords = [keyword.lower() for keyword in keywords]
def check_keywords(review):
return any(keyword in review.lower() for keyword in keywords)
data['student_or_not'] = data['text'].apply(check_keywords)
return data
keywords = ['student', 'students', 'college', 'colleges', 'university', 'universities', 'uni', "univ", "penn", "upenn", "ua", "uarizona", "usf", "purdue", "vanderbilt", "vandy", "vu", "unr", "u of a", "ualberta", "washu", "wustl", "ucsb", "uc"]
review_student = identify_student_reviews(review, keywords)
We have included some common abbreviations of the universities/colleges we are focusing our analysis around. Below are the translations:
- University of Pennsylvania
- Penn
- UPenn
- University of Arizona Tuscon
- UA
- UArizona
- University of South Florida
- USF
- Purdue University
- Purdue
- Vanderbilt University
- Vanderbilt
- Vandy
- VU
- Tulane University
- We could not find any informal names for Tulane University.
- University of Nevada
- UNR
- University of Alberta
- U of A
- UAlberta
- Washington University in St. Louis
- WashU
- WUSTL
- UC Santa Barbara
- UCSB
review_student.head()
review_id | user_id | business_id | stars | useful | funny | cool | text | date | student_or_not | |
---|---|---|---|---|---|---|---|---|---|---|
0 | KU_O5udG6zpxOg-VcAEodg | mh_-eMZ6K5RLWhZyISBhwA | XQfwVwDr-v0ZS3_CbbE5Xw | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 | True |
1 | BiTunyQ73aT9WBnpR9DZGw | OyoGAe7OKpv6SyGZT5g77Q | 7ATYjTIgM3jUlt4UM3IypQ | 5 | 1 | 0 | 1 | I've taken a lot of spin classes over the year... | 2012-01-03 15:28:18 | True |
2 | saUsX_uimxRlCVr67Z4Jig | 8g_iMtfSiwikVnbP2etR0A | YjUWPpI6HXG530lwP-fb2A | 3 | 0 | 0 | 0 | Family diner. Had the buffet. Eclectic assortm... | 2014-02-05 20:30:30 | True |
3 | AqPFMleE6RsU23_auESxiA | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 | False |
4 | Sx8TMOWLNuJBWer-0pcmoA | bcjbaE6dDog4jkNY91ncLQ | e4Vwtrqf-wpJfwesgvdgxQ | 4 | 1 | 0 | 1 | Cute interior and owner (?) gave us tour of up... | 2017-01-14 20:54:15 | True |
We can then inner join these two dataframe's on business_id to get a dataframe where the observations are reviews with information on if the review was written by a student and if the business the review was written for is close or far from a university/college.
review_businesses = pd.merge(review_student, business, how='inner', on='business_id')
Let's reduce this dataframe to the columns that we care about, namely the review rating, whether its written by a student, business' average rating, the amount of reviews that business has, and whether that business is close or far from a university/college.
review_businesses = review_businesses[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university']]
review_businesses = review_businesses.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})
review_businesses
city | restaurant_name | rating | student_or_not | avg_rating | review_count | close_to_university | |
---|---|---|---|---|---|---|---|
0 | Tucson | Kettle Restaurant | 3 | True | 3.5 | 47 | True |
1 | Tucson | Kettle Restaurant | 2 | True | 3.5 | 47 | True |
2 | Tucson | Kettle Restaurant | 5 | False | 3.5 | 47 | True |
3 | Tucson | Kettle Restaurant | 5 | False | 3.5 | 47 | True |
4 | Tucson | Kettle Restaurant | 3 | False | 3.5 | 47 | True |
... | ... | ... | ... | ... | ... | ... | ... |
44313 | Edmonton | Dairy Queen Grill & Chill | 1 | True | 2.0 | 6 | True |
44314 | Tampa | Grand China | 5 | False | 3.5 | 19 | False |
44315 | Philadelphia | Dough Boy Pizza | 5 | False | 4.5 | 11 | False |
44316 | Tucson | Burger King | 3 | False | 1.5 | 21 | True |
44317 | Edmonton | Versato's Pizza | 5 | False | 4.5 | 24 | False |
44318 rows × 7 columns
Note there are many restaurants with the same name, however as these are different restaurants of the same chain, we have decided to keep them as the information about close or far could be different would be useful to study there. The other reason is that there are many reviews for the same restaurant so in many cases it is still referring to the same restaurant.
Results¶
Exploratory Data Analysis¶
Now we can use the combined reviews and businesses dataframe to answer our research question: Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?
# import all the packages we need
import seaborn as sns
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
Average Ratings of Close vs Far Businesses¶
First lets look at the counts of restuarants that are close and restaurants that are far from an university/college.
How many businesses are close to a university?
review_businesses[review_businesses['close_to_university'] == True].shape[0]
30248
How many businesses are not close to a university?
review_businesses[review_businesses['close_to_university'] == False].shape[0]
14070
Average rating for business close to the university
review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean()
3.8469981486379266
Average rating for businesses not close to the university
review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()
3.737775408670931
test_statistic = review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean() - review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()
test_statistic
0.10922273996699561
When we compare the average ratings of restaurants close to and far from university campuses, the average ratings closer to the university are higher, but barely as their difference is less than 1. Because this difference is less than 1, we will be considering the average ratings to be similar. This similarity means that location and distance from a university does not have much impact on the average ratings of restaurants.
We'll be using this value as the test statistic to see whether this difference is observable by chance.
def permutation_tests():
diff_array = list()
shuffled_df = review_businesses.copy()
for i in range (1000):
shuffled_df['close_to_university'] = np.random.permutation(shuffled_df['close_to_university'])
close = shuffled_df[shuffled_df['close_to_university'] == True]['avg_rating'].mean()
not_close = shuffled_df[shuffled_df['close_to_university'] == False]['avg_rating'].mean()
diff_array.append((close-not_close))
return np.array(diff_array)
results = permutation_tests()
plt.hist(results)
plt.xlabel('Difference in Means')
plt.ylabel('Count')
results.mean()
np.mean(test_statistic < results)
0.0
Since the test value is much less than 0.05, this is a statistically significant value, meaning that it is very unlikely that this relation is due to chance. However, even though it is statiscally significant, the difference between the average ratings is still very small and that means that distance doesn't have much of an impact on the overall average rating.
Lets check out the average ratings of these restaurants per university to see if we can find a difference from the above result. First lets see the number of close/far restaurants we have per university. (Note that since we only chose one university per city, the city name works as a groupby value.)
rb_counts = pd.DataFrame(review_businesses.groupby('city')['close_to_university'].value_counts())
rb_counts
count | ||
---|---|---|
city | close_to_university | |
Edmonton | True | 786 |
False | 309 | |
Indianapolis | True | 2382 |
False | 1290 | |
Nashville | True | 3874 |
False | 1549 | |
New Orleans | True | 7911 |
False | 508 | |
Philadelphia | True | 9820 |
False | 2797 | |
Reno | True | 2271 |
False | 664 | |
Santa Barbara | False | 2125 |
True | 2 | |
St. Louis | False | 306 |
True | 79 | |
Tampa | False | 3385 |
True | 768 | |
Tucson | True | 2355 |
False | 1137 |
For most universities/colleges in our above list, there is a big disparity between the number of restaurants that are close versus restaurants that are far.
rb_avgs = pd.DataFrame(review_businesses.groupby(['city', 'close_to_university'])['avg_rating'].mean())
rb_avgs
avg_rating | ||
---|---|---|
city | close_to_university | |
Edmonton | False | 3.500000 |
True | 3.666667 | |
Indianapolis | False | 3.724031 |
True | 3.930940 | |
Nashville | False | 3.559716 |
True | 3.809112 | |
New Orleans | False | 3.820866 |
True | 3.938693 | |
Philadelphia | False | 3.747408 |
True | 3.790733 | |
Reno | False | 3.572289 |
True | 3.854690 | |
Santa Barbara | False | 3.900000 |
True | 3.500000 | |
St. Louis | False | 3.767974 |
True | 2.955696 | |
Tampa | False | 3.771935 |
True | 3.769531 | |
Tucson | False | 3.683377 |
True | 3.859236 |
For most universities/colleges above, it seems that restuarants that are further from the university/college are recieving lower ratings on average than restaurants that are close.
Linear Regression on Distance vs Ratings¶
We can perform linear regression to see if distance from a chosen university is a predictor for Yelp ratings.
First let's create a new column for our dataframe that contains the distance from a university/college. We can adapt our calc_distance function from before to calculate distance by removing the threshold check.
def calc_distance(df):
new_df = df.copy()
lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
lat1, long1 = df[['latitude', 'longitude']]
new_df['distance'] = haversine((lat, long), (lat1, long1))
return new_df
business_lg = business.apply(calc_distance, axis=1)
business_lg
business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | close_to_university | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... | True | 4.050 |
9 | bBDDEgkFA1Otx9Lfe7BZUQ | Sonic Drive-In | 2312 Dickerson Pike | Nashville | TN | 37207 | 36.208102 | -86.768170 | 1.5 | 10 | 1 | {'RestaurantsAttire': ''casual'', 'Restaurants... | Ice Cream & Frozen Yogurt, Fast Food, Burgers,... | {'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '... | True | 4.010 |
12 | il_Ro8jwPlHresjw9EGmBg | Denny's | 8901 US 31 S | Indianapolis | IN | 46227 | 39.637133 | -86.127217 | 2.5 | 28 | 1 | {'RestaurantsReservations': 'False', 'Restaura... | American (Traditional), Restaurants, Diners, B... | {'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ... | True | 4.958 |
15 | MUTTqe8uqyMdBl186RmNeA | Tuna Bar | 205 Race St | Philadelphia | PA | 19106 | 39.953949 | -75.143226 | 4.0 | 245 | 1 | {'RestaurantsReservations': 'True', 'Restauran... | Sushi Bars, Restaurants, Japanese | {'Tuesday': '13:30-22:0', 'Wednesday': '13:30-... | False | 5.421 |
19 | ROeacJQwBeh05Rqg7F6TCg | BAP | 1224 South St | Philadelphia | PA | 19147 | 39.943223 | -75.162568 | 4.5 | 205 | 1 | {'NoiseLevel': 'u'quiet'', 'GoodForMeal': '{'d... | Korean, Restaurants | {'Monday': '11:30-20:30', 'Tuesday': '11:30-20... | True | 3.281 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
150319 | 8n93L-ilMAsvwUatarykSg | Kitchen Gia | 3716 Spruce St | Philadelphia | PA | 19104 | 39.951018 | -75.198240 | 3.0 | 22 | 0 | {'RestaurantsGoodForGroups': 'True', 'BikePark... | Coffee & Tea, Food, Sandwiches, American (Trad... | {'Monday': '9:0-19:30', 'Tuesday': '9:0-19:30'... | True | 0.698 |
150321 | AM7O0cwkxm6w_e0Q7-f9FQ | Starbucks | 8817 S US-31 | Indianapolis | IN | 46227 | 39.638245 | -86.128069 | 4.0 | 29 | 1 | {'RestaurantsPriceRange2': '1', 'Caters': 'Fal... | Food, Coffee & Tea | {'Monday': '6:0-21:0', 'Tuesday': '6:0-21:0', ... | True | 4.864 |
150322 | 2MAQeAqmD8enCT2ZYqUgIQ | The Melting Pot - Nashville | 166 2nd Ave N, Ste A | Nashville | TN | 37201 | 36.163875 | -86.776311 | 4.0 | 204 | 0 | {'RestaurantsDelivery': 'False', 'RestaurantsR... | Fondue, Beer, Wine & Spirits, Food, Restaurants | {'Monday': '0:0-0:0', 'Tuesday': '16:0-21:0', ... | True | 3.088 |
150336 | WnT9NIzQgLlILjPT0kEcsQ | Adelita Taqueria & Restaurant | 1108 S 9th St | Philadelphia | PA | 19147 | 39.935982 | -75.158665 | 4.5 | 35 | 1 | {'WheelchairAccessible': 'False', 'Restaurants... | Restaurants, Mexican | {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... | True | 3.734 |
150340 | hn9Toz3s-Ei3uZPt7esExA | West Side Kebab House | 2470 Guardian Road NW | Edmonton | AB | T5T 1K8 | 53.509649 | -113.675999 | 4.5 | 18 | 0 | {'Ambience': '{'touristy': False, 'hipster': F... | Middle Eastern, Restaurants | {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... | False | 16.999 |
25000 rows × 16 columns
review_business_lg = pd.merge(review_student, business_lg, how='inner', on='business_id')
review_business_lg = review_business_lg[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university', 'distance']]
review_business_lg = review_business_lg.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})
Let's check the distribution on distances.
sns.histplot(data=business_lg['distance'])
f1 = plt.gcf()
It looks like there's an outlier, lets remove it and check the distribution again.
business_lg = business_lg.drop(business_lg['distance'].idxmax())
sns.histplot(data=business_lg['distance'])
f1 = plt.gcf()
Let's check the scatterplot to see if we can spot a linear relation between the Yelp ratings of restaurants and the distance they are from a university/college.
sns.scatterplot(data=review_business_lg, y='rating', x='distance')
<AxesSubplot:xlabel='distance', ylabel='rating'>
sns.scatterplot(data=review_business_lg, x='distance', y='avg_rating')
<AxesSubplot:xlabel='distance', ylabel='avg_rating'>
From the scatterplot there seems to be no relation between Yelp ratings and distance as there is about the same amount of low and high ratings for restaurants of all distances from an university/college. (Note that the scatterplot looks like this because the Yelp ratings are at hard intervals) If we were to draw a line between these points it should look flat. Lets check using linear regression.
outcome, predictors = patsy.dmatrices('rating ~ distance', review_business_lg)
mod = sm.OLS(outcome, predictors)
res_1 = mod.fit()
print(res_1.summary())
OLS Regression Results ============================================================================== Dep. Variable: rating R-squared: 0.001 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 41.19 Date: Tue, 19 Mar 2024 Prob (F-statistic): 1.40e-10 Time: 21:53:49 Log-Likelihood: -74091. No. Observations: 44318 AIC: 1.482e+05 Df Residuals: 44316 BIC: 1.482e+05 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.8933 0.008 459.921 0.000 3.877 3.910 distance -0.0078 0.001 -6.418 0.000 -0.010 -0.005 ============================================================================== Omnibus: 5044.150 Durbin-Watson: 1.701 Prob(Omnibus): 0.000 Jarque-Bera (JB): 6916.979 Skew: -0.962 Prob(JB): 0.00 Kurtosis: 2.795 Cond. No. 9.68 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
outcome_2, predictors_2 = patsy.dmatrices('avg_rating ~ distance', review_business_lg)
mod_2 = sm.OLS(outcome_2, predictors_2)
res_2 = mod_2.fit()
print(res_2.summary())
OLS Regression Results ============================================================================== Dep. Variable: avg_rating R-squared: 0.008 Model: OLS Adj. R-squared: 0.008 Method: Least Squares F-statistic: 351.9 Date: Tue, 19 Mar 2024 Prob (F-statistic): 3.34e-78 Time: 21:53:50 Log-Likelihood: -36320. No. Observations: 44318 AIC: 7.264e+04 Df Residuals: 44316 BIC: 7.266e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.8591 0.004 1069.028 0.000 3.852 3.866 distance -0.0098 0.001 -18.758 0.000 -0.011 -0.009 ============================================================================== Omnibus: 7562.557 Durbin-Watson: 0.181 Prob(Omnibus): 0.000 Jarque-Bera (JB): 14489.829 Skew: -1.058 Prob(JB): 0.00 Kurtosis: 4.835 Cond. No. 9.68 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In both linear regressions, the first trying to predict review rating, and the second trying to predict a restaurant's average rating, both using distance, we see a low pvalue in the distance row meaning that it is statistically significant and likely nonzero. However the coefficient for distance is still very small, despite the value being statistically significant so distance doesn't have much of an effect on reveiw rating or average rating.
However, it might be the case for the average ratings, that because there are many repeated values, it could affect the model. Lets revert back to just the business_lg dataframe and run linear regression using that.
outcome_3, predictors_3 = patsy.dmatrices('stars ~ distance', business_lg)
mod_3 = sm.OLS(outcome_3, predictors_3)
res_3 = mod_3.fit()
print(res_3.summary())
OLS Regression Results ============================================================================== Dep. Variable: stars R-squared: 0.008 Model: OLS Adj. R-squared: 0.008 Method: Least Squares F-statistic: 193.8 Date: Tue, 19 Mar 2024 Prob (F-statistic): 6.83e-44 Time: 21:53:50 Log-Likelihood: -30688. No. Observations: 24999 AIC: 6.138e+04 Df Residuals: 24997 BIC: 6.140e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.6597 0.008 450.330 0.000 3.644 3.676 distance -0.0135 0.001 -13.922 0.000 -0.015 -0.012 ============================================================================== Omnibus: 1452.109 Durbin-Watson: 2.011 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1721.998 Skew: -0.643 Prob(JB): 0.00 Kurtosis: 2.995 Cond. No. 13.1 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Again we see a pvalue of 0! This means that we are pretty confidant in our coefficient. However, the coefficient for distance is -0.0014, a value that is pretty much 0. So even in the case that we are confident, distance barely affects the Yelp rating of a restaurant (when we work in a scale of 0.5, changes on a scale of 0.0014 don't really matter).
Lets check these coefficients per university to see if this is different for some specific universities.
# do linear regression per city, since we only picked one university/college per city
for city in review_business_lg['city'].unique():
out, pred = patsy.dmatrices('avg_rating ~ distance', review_business_lg[review_business_lg['city']==city])
mod = sm.OLS(out, pred)
res = mod.fit()
print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])
Tucson distance coef, pvalue: -0.02348014710517908 1.973349041034627e-29 Philadelphia distance coef, pvalue: 0.002771896479638284 0.060048907543371485 New Orleans distance coef, pvalue: -0.019801151691254182 4.4198834660608146e-13 Santa Barbara distance coef, pvalue: -0.01908444339580161 6.537725936687052e-10 Indianapolis distance coef, pvalue: -0.028078788478805405 2.5607005144326463e-33 Tampa distance coef, pvalue: 0.000914553749047525 0.5695756118487593 Nashville distance coef, pvalue: -0.03489970549495028 8.052083656934554e-101 Reno distance coef, pvalue: -0.06509373125880094 5.080044002626258e-59 Edmonton distance coef, pvalue: -0.044720669841808114 2.2169385782843615e-14 St. Louis distance coef, pvalue: 0.05239540816671164 0.0006505199988430329
for city in business_lg['city'].unique():
out, pred = patsy.dmatrices('stars ~ distance', business_lg[business_lg['city']==city])
mod = sm.OLS(out, pred)
res = mod.fit()
print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])
Philadelphia distance coef, pvalue: -0.021001644673470512 6.938766368582747e-23 Nashville distance coef, pvalue: -0.02282392993556582 6.193023588559855e-12 Indianapolis distance coef, pvalue: -0.027594791297863783 7.873703222892081e-21 Edmonton distance coef, pvalue: -0.0226297324370053 1.2314762213914875e-08 Reno distance coef, pvalue: -0.03741402462439515 4.5181943588653566e-08 Tucson distance coef, pvalue: -0.017036831481580986 1.9837803869667264e-08 Tampa distance coef, pvalue: -0.005786451965124915 0.03325085579968091 Santa Barbara distance coef, pvalue: 0.027875758165070933 3.2117868091776005e-05 New Orleans distance coef, pvalue: -0.02132569106730503 3.8296498449916786e-05 St. Louis distance coef, pvalue: 0.013749277877485347 0.1674120455876262
Even separated among specific universities/colleges, we see mostly the same results. It is dominated by the fact that the coefficient for distance is still extremely small, thus distance doesn't have much of an effect on rating.
Student vs Non-student Reviews¶
Ultimately, we want to determine how students from these campuses impact these restaurants, i.e. do they increase or decrease restaurant's Yelp ratings? Let's start by checking the reviews of the restaurants that are close. How many are of these reviews are from students and how many are from non students?
close_reviews=review_businesses[review_businesses['close_to_university'] == True]
is_student=close_reviews[close_reviews['student_or_not']==True].shape[0]
print(is_student,'of the reviews of restaurants that are close to campus are from students')
not_student=close_reviews[close_reviews['student_or_not']==False].shape[0]
print(not_student, 'of the reviews of restaurants that are close to campus are not from students')
14069 of the reviews of restaurants that are close to campus are from students 16179 of the reviews of restaurants that are close to campus are not from students
What is the average Yelp rating among the reviews from students? What is the average Yelp rating among the reviews from non students?
avg_rating_student=close_reviews[close_reviews['student_or_not']==True]['avg_rating'].mean()
print(avg_rating_student,'is the average Yelp rating among the reviews from students')
avg_rating_nonstudent=close_reviews[close_reviews['student_or_not']==False]['avg_rating'].mean()
print(avg_rating_nonstudent,' is the average Yelp rating among the reviews from non students')
3.829945269741986 is the average Yelp rating among the reviews from students 3.861827059768836 is the average Yelp rating among the reviews from non students
According to the previous calculation, we can see that the average ratings from non-students is slightly higher than the average ratings from non-students(the difference is smaller than 1). In order to further compare between the mean ratings of reviews from students and non-students, we decided to perform a permutation test to assess the statistical significance of the difference. The test statistic in this permutation test would be the difference in the average ratings between students and non-students from our dataset.
test_statistic = review_businesses[review_businesses['student_or_not'] == False]['avg_rating'].mean() - review_businesses[review_businesses['student_or_not'] == True]['avg_rating'].mean()
test_statistic
0.02512900573164778
By comparing the original test statistic to the distribution of the test statistics from the permuted samples, we can determine how extreme the original difference is.
def permutation_tests():
diff_array = list()
shuffled_df = review_businesses.copy()
for i in range (1000):
shuffled_df['student_or_not'] = np.random.permutation(shuffled_df['student_or_not'])
student = shuffled_df[shuffled_df['student_or_not'] == True]['avg_rating'].mean()
nonstudent = shuffled_df[shuffled_df['student_or_not'] == False]['avg_rating'].mean()
diff_array.append((nonstudent-student))
return np.array(diff_array)
results = permutation_tests()
plt.hist(results)
plt.xlabel('Difference in Means')
plt.ylabel('Count')
results.mean()
np.mean(test_statistic <= results)
0.0
The p-value from the permutation test is 0.001 which is way below the 0.05/0.01 significance level. Therefore, we can conclude that the difference in means between average ratings from student and non-student reviews is statistically significant, meaning that it's unlikely the observed difference occurred by random chance. However, the difference could be statistically significant but still small or unimportant in practical terms given the difference of observed ratings between the two groups is very small (or almost negligible).
We can again split the analysis of reviews up by university to see if some universities have different results from others.
city_to_university = universities['University Name'].to_dict()
review_businesses['university_name'] = review_businesses['city'].map(city_to_university)
review_businesses
city | restaurant_name | rating | student_or_not | avg_rating | review_count | close_to_university | university_name | |
---|---|---|---|---|---|---|---|---|
0 | Tucson | Kettle Restaurant | 3 | True | 3.5 | 47 | True | University of Arizona Tuscon |
1 | Tucson | Kettle Restaurant | 2 | True | 3.5 | 47 | True | University of Arizona Tuscon |
2 | Tucson | Kettle Restaurant | 5 | False | 3.5 | 47 | True | University of Arizona Tuscon |
3 | Tucson | Kettle Restaurant | 5 | False | 3.5 | 47 | True | University of Arizona Tuscon |
4 | Tucson | Kettle Restaurant | 3 | False | 3.5 | 47 | True | University of Arizona Tuscon |
... | ... | ... | ... | ... | ... | ... | ... | ... |
44313 | Edmonton | Dairy Queen Grill & Chill | 1 | True | 2.0 | 6 | True | University of Alberta |
44314 | Tampa | Grand China | 5 | False | 3.5 | 19 | False | University of South Florida |
44315 | Philadelphia | Dough Boy Pizza | 5 | False | 4.5 | 11 | False | University of Pennsylvania |
44316 | Tucson | Burger King | 3 | False | 1.5 | 21 | True | University of Arizona Tuscon |
44317 | Edmonton | Versato's Pizza | 5 | False | 4.5 | 24 | False | University of Alberta |
44318 rows × 8 columns
university_city_groups = review_businesses.groupby(['university_name', 'city'])
university_city_averages = []
for (university, city), group in university_city_groups:
avg_rating_students = group[group['student_or_not'] == True]['rating'].mean()
avg_rating_nonstudents = group[group['student_or_not'] == False]['rating'].mean()
university_city_averages.append({
'University Name': university,
'City': city,
'Average Rating from Students': avg_rating_students,
'Average Rating from Non-Students': avg_rating_nonstudents
})
summary_table = pd.DataFrame(university_city_averages)
summary_table
University Name | City | Average Rating from Students | Average Rating from Non-Students | |
---|---|---|---|---|
0 | Purdue University | Indianapolis | 3.831654 | 3.903778 |
1 | Tulane University | New Orleans | 3.855110 | 4.061557 |
2 | UC Santa Barbara | Santa Barbara | 3.801944 | 4.063650 |
3 | University of Alberta | Edmonton | 3.608130 | 3.672917 |
4 | University of Arizona Tuscon | Tucson | 3.844771 | 3.876812 |
5 | University of Nevada | Reno | 3.737705 | 3.984840 |
6 | University of Pennsylvania | Philadelphia | 3.751565 | 3.905575 |
7 | University of South Florida | Tampa | 3.674469 | 3.886010 |
8 | Vanderbilt University | Nashville | 3.602155 | 3.892362 |
9 | Washington University in St. Louis | St. Louis | 3.705263 | 3.635897 |
Looking at the overall average ratings from students and non-students, we can observe that in 9 out of the 10 cities, the average ratings were lower when they were submitted from students compared to the ratings submitted from non-students. We also want to emphasize how the different is not significantly lower, so this shows that students may have an effect, but the effect isn't big.
Ethics & Privacy¶
We need inclusive data representation, by ensuring the dataset adequately represents different types of restaurants and doesn’t exclude any significant group. We must ensure that the data sources that we use comply with ethical standards and privacy laws. The data should be publicly available and not include any personal information about individuals who submitted the ratings. We must review the data and results to monitor for biases.
The data on Yelp and Google Maps is self-uploaded, so some restaurants may not appear. Since the data is self-uploaded, we believe the restaurants would be open to having other people see it. The restaurants are self-uploaded. After a restaurant is uploaded, they cannot be taken down unless the restaurant has closed down. Since closed restaurants do not show up, the data could be biased. For instance, restaurants that have closed down could potentially have lower ratings, but these lower ratings are no longer part of the data. Another bias could be with online reviews in general. Restaurants could encourage positive reviews by offering a discount or a free dessert to customers. Another concern is that people who have a negative experience might feel frustrated and post negative reviews while a person who has a great experience has no problems, so they also don’t feel the need to make a review. Negative reviews may be overrepresented.
We must develop a methodology that addresses these issues, such as figuring out how we are going to include each level of budget, restaurant type, etc. We have to be sure to exclude personal identifiers (such as names) for those who submitted reviews. We also have to make sure to remove the reviewer’s names. We will include restaurants in the main price categories and all cuisines that are ordered to avoid bias.
Discussion and Conclusion¶
Our comprehensive data analysis so far has provided us with insightful revelations about the relationship between a restaurant’s distance to universities/colleges campuses in different cities and its Yelp ratings, as well as the influence of student reviewers on these ratings. After wrangling the raw dataset via various methods, we finalized it into a comprehensive dataframe that contains the important columns of information that we wanted like ‘is the restaurant close to campus or not’, ‘is the reviewers a student or not’ and ‘yelp ratings of these restaurants’ etc. And then our team conducted a series of data analysis on this dataframe using relevant statistical techniques and methods including performing permutation test,t-test and linear regression. Based on what we had so far, we can conclude that the distance of restaurants from campuses bears very minimal correlation with their Yelp ratings, suggesting that distance of restaurants from campuses is not necessarily important in how customers perceive or rate these restaurants. Besides, the analysis indicates that ratings submitted by students would lower the average ratings of these restaurants and that’s a general trend across the 10 universities that we had for our analysis, but the influence is too small to be actually taken into consideration in practical terms.
However, we do believe that there are several limitations in regards to the scope and outcomes of our analysis. One of the primary constraints was the selection of only 10 cities and universities for our analysis. This choice was dictated by time constraints and the fact that only 49% of the top 29 cities in our dataset met our research criteria. Consequently, this limitation raises questions about the generalizability of our findings to other cities and university contexts worldwide. Additionally, the method we used to distinguish between student and non-student reviews faced challenges, with a notable presence of false positives and negatives. This issue potentially affected the accuracy of our interpretations. Expanding the dataset to include a more diverse range of cities and improving the accuracy of identifying student reviews are helpful for better future research. These enhancements would not only address the limitations of our current research but also provide a more comprehensive understanding of how location and customer demographics influence restaurant ratings on platforms like Yelp.
We believe our work contributes to a broader understanding of what influences students’ patronage at restaurants. While other previous studies have focused on factors like food quality, service,decor and price, our research added a new dimension by exploring the spatial aspect of dining preferences. Future research can be done based on our work to investigate more about the factors that might influence consumer behaviors and to provide inspiration for restaurant owners on how to attract more customers.
Team Contributions¶
- Carmen Truong - Finding dataset, Reviews dataset wrangling, Close/Far average business rating eda, conclusion of video
- Zifan Luo - Finding dataset, Reviews dataset wrangling, Student vs Non-student reviews eda, conclusion
- Rabih Siddiqui - Compiled universities dataset, Student vs Non-student reviews eda, editing video, ethics and privacy
- Lewis Weng - Businesses dataset wrangling, close/far average business rating eda, ethics and privacy, abstract
- Jacob Lin - Businesses dataset wrangling, businesses linear regression, intro of video, compiling work and uploading notebooks