The Analysis of Restaurant Ratings and Their Locations¶

Abstract¶

As students, we often find ourselves deciding where to dine out. When looking for new restaurants, one of the first things we do is peruse yelp ratings/reviews to see nearby restaurants and others' dining experiences at those restaurants. Knowing this, we were curious as to whether businesses in proximity to universities have better/worse ratings and whether whether students patrons rated restaurants lower/higher relative to other customers.

To focus our study on venues frequented by students, we first filtered yelp's dataset to include businesses under the "Restaurant" and "Coffee & Tea". Then, we selected several universities in the US and applied haversine's formula to calculate the distance between each university and the restaurants within the same city. For our project, we defined "close to a university" as under 5 kilometers, with anything above being defined as "not close to a university." This distance was chosen due to its accessibility via public transportation and walking, aligning with typical student mobility. Our analysis revealed a trend: restaurants proximate to universities tend to have marginally higher ratings than those farther away. While the discrepancy is subtle, it's discernible and statistically significant, suggesting a non-random association.

Transitioning to the next section, we wanted to answer the question: "Does distance from university affect a restaurants's rating?" Perhaps restaurants that are near universities have better food beacuse the lease is more expensive and they need more customers to sustain their business. For this question, we answer it using linear regression. We found that while distance from university affects rating, the effect is basically negligible.

Finally, we wanted to see whether students tend to give better/worse ratings than other customers, so we used keywords to classify whether a review was written by a student or not. For this question, we also answer it using permutation tests to see whether any differences are caused by chance. Our conclusion was that students give lower ratings and the differences between students and non-students are not caused by chance.

Research Question¶

Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?

Background and Prior Work¶

Colleges and universities host many students every year and as the number of students grow business around these places also grow, some even forming college towns where businesses’ livelihoods depend on students like at UC Davis. As students of UCSD, we were curious as to specifically restaurants in the vicinity of a college or university. As a restaurant’s success or profits are hard to measure and obtain, we settled on trying to see if restaurants’ ratings are affected by their vicinity to the university in their city.

One article published by QSR Magazine seems to indicate so, citing data from “College & University Keynote Report” from Datassential saying that 58% of students eat off campus and 49% of students consider themselves foodies and are more conscious about what they eat.1 This means that more than half of the student body is regularly eating at restaurants around their campus and reviewing and recommending those restaurants. Another study that more closely looks at customer satisfaction with food service, points out several important factors that can contribute to a restaurant’s rating such as their food quality, service quality, decor quality, and most importantly price.2 The conclusions they found is that good service and then good food were the best indicators of a high customer satisfaction. These point out significant attributes that aren’t considered in our research question, and possibly could be much greater of a factor of a restaurant’s rating than distance.

  1. Baltazar, Amanda. “Restaurants Would Be Wise to Court College Students.” QSR Magazine, 7 July 2023, https://www.qsrmagazine.com/operations/business-advice/restaurants-would-be-wise-court-college-students/
  2. Serhan, Mireille, and Carole Serhan. “The Impact of Food Service Attributes on Customer Satisfaction in a Rural University Campus Environment.” International Journal of Food Science, Hindawi, 31 Dec. 2019, https://www.hindawi.com/journals/ijfs/2019/2154548/

Hypothesis¶

We believe that restaurants further from the university/college campuses will have similar ratings as those closer to university/college campuses. While an influx of students in the area can have an impact on nearby restaurants, we believe that this impact will be minimal and negligible. This is because students don’t make up a majority of a restaurants’ clientele, especially in more metropolitan areas. There are many other customers that either live in the area or are traveling that can give ratings to restaurants.

Data¶

  • Dataset Name: Yelp Academic Dataset
  • Link to the dataset: https://www.yelp.com/dataset
  • Number of observations: 6990280 reviews, 150346 businesses
  • Number of variables: 9, 14

This dataset contains information on a selection of businesses, reviews, and user data centered around different metropolitan areas from the app Yelp. It is separated into 5 different json files, businesses, reviews, checkin, tip, user. We only particularly care about the businesses and reviews. For businesses, the variables we care about are the business_id, city, longitude, latitude, stars, review_count. For reviews, the variables we care about are the business_id, stars and text.

Yelp Academic Dataset¶

In [1]:
# import necessary libraries

import numpy as np
import pandas as pd

First let's load the business data into a dataframe.

In [2]:
# load business data
business = pd.read_json('https://drive.usercontent.google.com/download?id=1HGtRB3g1Hx1t1j2vPqCdTEfG-WJtTFVN&confirm=xxx', lines=True)

Let's drop all the observations with missing values in the important columns

In [3]:
business = business.dropna(subset=['latitude', 'longitude', 'stars', 'review_count', 'categories']) # this changes nothing though
In [4]:
business.head()
Out[4]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 Pns2l4eNsfO8kk83dixA6A Abby Rappoport, LAC, CMQ 1616 Chapala St, Ste 2 Santa Barbara CA 93101 34.426679 -119.711197 5.0 7 0 {'ByAppointmentOnly': 'True'} Doctors, Traditional Chinese Medicine, Naturop... None
1 mpf3x-BjTdTEA3yCZrAYPw The UPS Store 87 Grasso Plaza Shopping Center Affton MO 63123 38.551126 -90.335695 3.0 15 1 {'BusinessAcceptsCreditCards': 'True'} Shipping Centers, Local Services, Notaries, Ma... {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...
2 tUFrWirKiKi_TAnsVWINQQ Target 5255 E Broadway Blvd Tucson AZ 85711 32.223236 -110.880452 3.5 22 0 {'BikeParking': 'True', 'BusinessAcceptsCredit... Department Stores, Shopping, Fashion, Home & G... {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 4.0 80 1 {'RestaurantsDelivery': 'False', 'OutdoorSeati... Restaurants, Food, Bubble Tea, Coffee & Tea, B... {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
4 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery 101 Walnut St Green Lane PA 18054 40.338183 -75.471659 4.5 13 1 {'BusinessAcceptsCreditCards': 'True', 'Wheelc... Brewpubs, Breweries, Food {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2...

Restaurants and cafes are the businesses that we care about, so let's filter our business dataframe by category.

In [5]:
def identify_restaurants(data, keywords):
    keywords = [keyword.lower() for keyword in keywords]
    def check_categories(category):
        return any(keyword in category.lower() for keyword in keywords)
    return data[data['categories'].apply(check_categories)]
In [6]:
keywords = ['Restaurants', 'Coffee & Tea']
business = identify_restaurants(business, keywords)
business
Out[6]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 4.0 80 1 {'RestaurantsDelivery': 'False', 'OutdoorSeati... Restaurants, Food, Bubble Tea, Coffee & Tea, B... {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
5 CF33F8-E6oudUQ46HnavjQ Sonic Drive-In 615 S Main St Ashland City TN 37015 36.269593 -87.058943 2.0 6 1 {'BusinessParking': 'None', 'BusinessAcceptsCr... Burgers, Fast Food, Sandwiches, Food, Ice Crea... {'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '...
8 k0hlBqXX-Bt0vf1op7Jr1w Tsevi's Pub And Grill 8025 Mackenzie Rd Affton MO 63123 38.565165 -90.321087 3.0 19 0 {'Caters': 'True', 'Alcohol': 'u'full_bar'', '... Pubs, Restaurants, Italian, Bars, American (Tr... None
9 bBDDEgkFA1Otx9Lfe7BZUQ Sonic Drive-In 2312 Dickerson Pike Nashville TN 37207 36.208102 -86.768170 1.5 10 1 {'RestaurantsAttire': ''casual'', 'Restaurants... Ice Cream & Frozen Yogurt, Fast Food, Burgers,... {'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '...
11 eEOYSgkmpB90uNA7lDOMRA Vietnamese Food Truck Tampa Bay FL 33602 27.955269 -82.456320 4.0 10 1 {'Alcohol': ''none'', 'OutdoorSeating': 'None'... Vietnamese, Food, Restaurants, Food Trucks {'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
150327 cM6V90ExQD6KMSU3rRB5ZA Dutch Bros Coffee 1181 N Milwaukee St Boise ID 83704 43.615401 -116.284689 4.0 33 1 {'WiFi': ''free'', 'RestaurantsGoodForGroups':... Cafes, Juice Bars & Smoothies, Coffee & Tea, R... {'Monday': '0:0-0:0', 'Tuesday': '0:0-17:0', '...
150328 1jx1sfgjgVg0nM6n3p0xWA Savaya Coffee Market 11177 N Oracle Rd Oro Valley AZ 85737 32.409552 -110.943073 4.5 41 1 {'BusinessParking': '{'garage': False, 'street... Specialty Food, Food, Coffee & Tea, Coffee Roa... {'Monday': '0:0-0:0', 'Tuesday': '6:0-14:0', '...
150336 WnT9NIzQgLlILjPT0kEcsQ Adelita Taqueria & Restaurant 1108 S 9th St Philadelphia PA 19147 39.935982 -75.158665 4.5 35 1 {'WheelchairAccessible': 'False', 'Restaurants... Restaurants, Mexican {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...
150339 2O2K6SXPWv56amqxCECd4w The Plum Pit 4405 Pennell Rd Aston DE 19014 39.856185 -75.427725 4.5 14 1 {'RestaurantsDelivery': 'False', 'BusinessAcce... Restaurants, Comfort Food, Food, Food Trucks, ... {'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...
150340 hn9Toz3s-Ei3uZPt7esExA West Side Kebab House 2470 Guardian Road NW Edmonton AB T5T 1K8 53.509649 -113.675999 4.5 18 0 {'Ambience': '{'touristy': False, 'hipster': F... Middle Eastern, Restaurants {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...

54918 rows × 14 columns

In [7]:
business['city'].value_counts()[:30]
Out[7]:
city
Philadelphia        6171
Tampa               3119
Indianapolis        3004
Tucson              2623
Nashville           2612
New Orleans         2369
Edmonton            2321
Saint Louis         1836
Reno                1396
Boise                912
Santa Barbara        812
Clearwater           712
Wilmington           638
St. Louis            573
Metairie             543
Saint Petersburg     523
Franklin             463
St. Petersburg       427
Sparks               363
Brandon              342
Meridian             338
Largo                327
Carmel               314
Cherry Hill          313
West Chester         292
New Port Richey      238
Kenner               236
Goleta               232
Greenwood            221
Fishers              218
Name: count, dtype: int64

From above we can see that the dataset has data for restaurants from many different cities. We will pick a few and find universities/colleges within to base our research around. Our initial selection was based on the top cities (with most reviews) in our dataset. However, upon further analysis, we identified that 51% of the cities in the top 29 in our dataset didn't have universities that fit our research criteria. We then refined our list to only include cities with universities in the city limits.

In [8]:
universities = pd.read_csv('./yelp_dataset/universities.csv')
In [9]:
universities = universities.set_index('City')
In [10]:
universities
Out[10]:
Country State University Name Latitude Longitude
City
Philadelphia USA Pennsylvania University of Pennsylvania 39.952583 -75.191975
Tucson USA Arizona University of Arizona Tuscon 32.221664 -110.948922
Tampa USA Florida University of South Florida 28.051836 -82.400005
Indianapolis USA Indiana Purdue University 39.776709 -86.170811
Nashville USA Tennessee Vanderbilt University 36.145532 -86.804060
New Orleans USA Louisiana Tulane University 29.958586 -90.064997
Reno USA Nevada University of Nevada 39.534642 -119.812831
Edmonton Canada Alberta University of Alberta 53.523219 -113.523219
St. Louis USA Missouri Washington University in St. Louis 38.628900 -90.307200
Santa Barbara USA California UC Santa Barbara 34.413953 -119.848956

Let's also filter our business dataframe to only include businesses inside these cities.

In [11]:
business = business[business['city'].isin(universities.index)]
business['city'].value_counts()
Out[11]:
city
Philadelphia     6171
Tampa            3119
Indianapolis     3004
Tucson           2623
Nashville        2612
New Orleans      2369
Edmonton         2321
Reno             1396
Santa Barbara     812
St. Louis         573
Name: count, dtype: int64

We can then use this information on universities to calculate whether a business is close or far from a university using their latitude and longitude positions.

In [12]:
# this is a lat long distance calculator from https://community.esri.com/t5/coordinate-reference-systems-blog/distance-on-a-sphere-the-haversine-formula/ba-p/902128#:~:text=All%20of%20these%20can%20be,longitude%20of%20the%20two%20points

def haversine(coord1, coord2):
    import math

    # Coordinates in decimal degrees (e.g. 2.89078, 12.79797)
    lon1, lat1 = coord1
    lon2, lat2 = coord2

    R = 6371000  # radius of Earth in meters
    phi_1 = math.radians(lat1)
    phi_2 = math.radians(lat2)

    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)

    a = math.sin(delta_phi / 2.0)**2 + math.cos(phi_1) * math.cos(phi_2) * math.sin(delta_lambda / 2.0)**2

    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    meters = R * c  # output distance in meters
    km = meters / 1000.0  # output distance in kilometers

    meters = round(meters, 3)
    km = round(km, 3)


#     print(f"Distance: {meters} m")
#     print(f"Distance: {km} km‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍")
    return km
In [13]:
def calc_distance(df):
    new_df = df.copy()
    lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
    lat1, long1 = df[['latitude', 'longitude']]
    # threshold 5 km
    new_df['close_to_university'] = haversine((lat, long), (lat1, long1)) < 5
    return new_df

We decided to use a threshold value of 10 km. A business would be considered close to an university/college if it is within 10 km of one and far otherwise.

In [14]:
business = business.apply(calc_distance, axis=1)

Now let's create a dataframe for our reviews. There are simply too many reviews (6990280!) and the kernel cannot handle loading that many observations into memory, thus we have cut down the file using python to just the first 100000 reviews.

In [15]:
review = pd.read_json('https://drive.usercontent.google.com/download?id=1xE5dbDWd1Mp8kFQwoMmtLVj5xPq9tpuG&confirm=xxx', lines=True)
In [16]:
review.head()
Out[16]:
review_id user_id business_id stars useful funny cool text date
0 KU_O5udG6zpxOg-VcAEodg mh_-eMZ6K5RLWhZyISBhwA XQfwVwDr-v0ZS3_CbbE5Xw 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11
1 BiTunyQ73aT9WBnpR9DZGw OyoGAe7OKpv6SyGZT5g77Q 7ATYjTIgM3jUlt4UM3IypQ 5 1 0 1 I've taken a lot of spin classes over the year... 2012-01-03 15:28:18
2 saUsX_uimxRlCVr67Z4Jig 8g_iMtfSiwikVnbP2etR0A YjUWPpI6HXG530lwP-fb2A 3 0 0 0 Family diner. Had the buffet. Eclectic assortm... 2014-02-05 20:30:30
3 AqPFMleE6RsU23_auESxiA _7bHUi9Uuf5__HHc_Q8guQ kxX2SOes4o-D3ZQBkiMRfA 5 1 0 1 Wow! Yummy, different, delicious. Our favo... 2015-01-04 00:01:03
4 Sx8TMOWLNuJBWer-0pcmoA bcjbaE6dDog4jkNY91ncLQ e4Vwtrqf-wpJfwesgvdgxQ 4 1 0 1 Cute interior and owner (?) gave us tour of up... 2017-01-14 20:54:15

Again we drop observations with missing values in the important variables.

In [17]:
review.dropna(subset=['stars', 'text'])
Out[17]:
review_id user_id business_id stars useful funny cool text date
0 KU_O5udG6zpxOg-VcAEodg mh_-eMZ6K5RLWhZyISBhwA XQfwVwDr-v0ZS3_CbbE5Xw 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11
1 BiTunyQ73aT9WBnpR9DZGw OyoGAe7OKpv6SyGZT5g77Q 7ATYjTIgM3jUlt4UM3IypQ 5 1 0 1 I've taken a lot of spin classes over the year... 2012-01-03 15:28:18
2 saUsX_uimxRlCVr67Z4Jig 8g_iMtfSiwikVnbP2etR0A YjUWPpI6HXG530lwP-fb2A 3 0 0 0 Family diner. Had the buffet. Eclectic assortm... 2014-02-05 20:30:30
3 AqPFMleE6RsU23_auESxiA _7bHUi9Uuf5__HHc_Q8guQ kxX2SOes4o-D3ZQBkiMRfA 5 1 0 1 Wow! Yummy, different, delicious. Our favo... 2015-01-04 00:01:03
4 Sx8TMOWLNuJBWer-0pcmoA bcjbaE6dDog4jkNY91ncLQ e4Vwtrqf-wpJfwesgvdgxQ 4 1 0 1 Cute interior and owner (?) gave us tour of up... 2017-01-14 20:54:15
... ... ... ... ... ... ... ... ... ...
99995 pAEbIxvr6ebx2bHc1XvguA SMH5CeiLvKx61lKwtLZ_PA lV0k3BnslFRkuWD_kbKd0Q 4 0 0 0 Came here for lunch with a group. They were bu... 2018-05-30 22:28:56
99996 xH1AoE-4nf2ECGQJRjO4_g 2clTdtp-BjphxLjN83CpUA G0xz3kyRhRi6oZl7KfR0pA 1 1 0 0 The equipment is so old and so felty! I just u... 2015-04-05 23:31:52
99997 GatIbXTz-WDru5emONUSIg MRrN6DH3QGCFcDv5RENYVg C4lZdhasjZVQyDlOiXY1sA 4 0 0 0 This is one of my favorite Mexican restaurants... 2016-06-04 00:59:15
99998 6NfkodAdhvI89xONXuBC3A rnNQzeKJbvqVCsYsL10mkQ dChRGpit9fM_kZK5pafNyA 2 0 0 0 Came here for brunch - had an omlette ($19 + t... 2018-06-11 12:45:08
99999 sJ1BMq7lkKgOWEFx3n6ZRw _BcWyKQL16ndpBdggh2kNA hMcgO98QaOFmQVTfCUeGzw 5 0 0 0 Came in for my 5-6 month prophy and saw Kara -... 2013-06-06 10:10:33

100000 rows × 9 columns

To figure out which reviews are written by college students, we queried reviews that have any mention of words related to them. To do this, we chose keywords such as student, students, college, colleges, university, universities, and uni. These keywords were chosen in the hopes that the review was written by a person who is mentioning their status as a student to be used as their way of reliability to other reviewees.

We also want to disclose concerns of using such a method. By querying reviews written by students this way, we could be counting false positives and missing false negatives. In this case, a false positive is when a review incorrectly classifies a non-student's review as a student, while a false negative is when a review incorrectly classifies a student's review as a non-student. Here are a few examples:

  1. False Positive: A non-student brought up a "college" nearby.
  2. False Positive: The "students" mentioned in a review could be students in highschool, not college.
  3. False Positive: A non-student talks about how a restaurant is often visited by many "students"
  4. False Negative: The review could have been written by a student, but the person did not mention that they were a student in their review.
In [18]:
def identify_student_reviews(data, keywords):

    keywords = [keyword.lower() for keyword in keywords]


    def check_keywords(review):

        return any(keyword in review.lower() for keyword in keywords)

    data['student_or_not'] = data['text'].apply(check_keywords)
    return data
In [19]:
keywords = ['student', 'students', 'college', 'colleges', 'university', 'universities', 'uni', "univ", "penn", "upenn", "ua", "uarizona", "usf", "purdue", "vanderbilt", "vandy", "vu", "unr", "u of a", "ualberta", "washu", "wustl", "ucsb", "uc"]
review_student = identify_student_reviews(review, keywords)

We have included some common abbreviations of the universities/colleges we are focusing our analysis around. Below are the translations:

  • University of Pennsylvania
    • Penn
    • UPenn
  • University of Arizona Tuscon
    • UA
    • UArizona
  • University of South Florida
    • USF
  • Purdue University
    • Purdue
  • Vanderbilt University
    • Vanderbilt
    • Vandy
    • VU
  • Tulane University
    • We could not find any informal names for Tulane University.
  • University of Nevada
    • UNR
  • University of Alberta
    • U of A
    • UAlberta
  • Washington University in St. Louis
    • WashU
    • WUSTL
  • UC Santa Barbara
    • UCSB
In [20]:
review_student.head()
Out[20]:
review_id user_id business_id stars useful funny cool text date student_or_not
0 KU_O5udG6zpxOg-VcAEodg mh_-eMZ6K5RLWhZyISBhwA XQfwVwDr-v0ZS3_CbbE5Xw 3 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11 True
1 BiTunyQ73aT9WBnpR9DZGw OyoGAe7OKpv6SyGZT5g77Q 7ATYjTIgM3jUlt4UM3IypQ 5 1 0 1 I've taken a lot of spin classes over the year... 2012-01-03 15:28:18 True
2 saUsX_uimxRlCVr67Z4Jig 8g_iMtfSiwikVnbP2etR0A YjUWPpI6HXG530lwP-fb2A 3 0 0 0 Family diner. Had the buffet. Eclectic assortm... 2014-02-05 20:30:30 True
3 AqPFMleE6RsU23_auESxiA _7bHUi9Uuf5__HHc_Q8guQ kxX2SOes4o-D3ZQBkiMRfA 5 1 0 1 Wow! Yummy, different, delicious. Our favo... 2015-01-04 00:01:03 False
4 Sx8TMOWLNuJBWer-0pcmoA bcjbaE6dDog4jkNY91ncLQ e4Vwtrqf-wpJfwesgvdgxQ 4 1 0 1 Cute interior and owner (?) gave us tour of up... 2017-01-14 20:54:15 True

We can then inner join these two dataframe's on business_id to get a dataframe where the observations are reviews with information on if the review was written by a student and if the business the review was written for is close or far from a university/college.

In [21]:
review_businesses = pd.merge(review_student, business, how='inner', on='business_id')

Let's reduce this dataframe to the columns that we care about, namely the review rating, whether its written by a student, business' average rating, the amount of reviews that business has, and whether that business is close or far from a university/college.

In [22]:
review_businesses = review_businesses[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university']]
review_businesses = review_businesses.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})
In [23]:
review_businesses
Out[23]:
city restaurant_name rating student_or_not avg_rating review_count close_to_university
0 Tucson Kettle Restaurant 3 True 3.5 47 True
1 Tucson Kettle Restaurant 2 True 3.5 47 True
2 Tucson Kettle Restaurant 5 False 3.5 47 True
3 Tucson Kettle Restaurant 5 False 3.5 47 True
4 Tucson Kettle Restaurant 3 False 3.5 47 True
... ... ... ... ... ... ... ...
44313 Edmonton Dairy Queen Grill & Chill 1 True 2.0 6 True
44314 Tampa Grand China 5 False 3.5 19 False
44315 Philadelphia Dough Boy Pizza 5 False 4.5 11 False
44316 Tucson Burger King 3 False 1.5 21 True
44317 Edmonton Versato's Pizza 5 False 4.5 24 False

44318 rows × 7 columns

Note there are many restaurants with the same name, however as these are different restaurants of the same chain, we have decided to keep them as the information about close or far could be different would be useful to study there. The other reason is that there are many reviews for the same restaurant so in many cases it is still referring to the same restaurant.

Results¶

Exploratory Data Analysis¶

Now we can use the combined reviews and businesses dataframe to answer our research question: Do restaurants further from college/university campuses have higher or lower ratings than those that are closer? Can we determine if the reviewers of these restaurants are students and if so, do students increase or lower Yelp ratings at these restaurants?

In [24]:
# import all the packages we need

import seaborn as sns
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm

Average Ratings of Close vs Far Businesses¶

First lets look at the counts of restuarants that are close and restaurants that are far from an university/college.

How many businesses are close to a university?

In [25]:
review_businesses[review_businesses['close_to_university'] == True].shape[0]
Out[25]:
30248

How many businesses are not close to a university?

In [26]:
review_businesses[review_businesses['close_to_university'] == False].shape[0]
Out[26]:
14070

Average rating for business close to the university

In [27]:
review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean()
Out[27]:
3.8469981486379266

Average rating for businesses not close to the university

In [28]:
review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()
Out[28]:
3.737775408670931
In [29]:
test_statistic = review_businesses[review_businesses['close_to_university'] == True]['avg_rating'].mean() - review_businesses[review_businesses['close_to_university'] == False]['avg_rating'].mean()
test_statistic
Out[29]:
0.10922273996699561

When we compare the average ratings of restaurants close to and far from university campuses, the average ratings closer to the university are higher, but barely as their difference is less than 1. Because this difference is less than 1, we will be considering the average ratings to be similar. This similarity means that location and distance from a university does not have much impact on the average ratings of restaurants.

We'll be using this value as the test statistic to see whether this difference is observable by chance.

In [30]:
def permutation_tests():
    diff_array = list()
    shuffled_df = review_businesses.copy()
    for i in range (1000):
        shuffled_df['close_to_university'] = np.random.permutation(shuffled_df['close_to_university'])
        close = shuffled_df[shuffled_df['close_to_university'] == True]['avg_rating'].mean()
        not_close = shuffled_df[shuffled_df['close_to_university'] == False]['avg_rating'].mean()
        diff_array.append((close-not_close))
    return np.array(diff_array)

results = permutation_tests()
plt.hist(results)
plt.xlabel('Difference in Means')
plt.ylabel('Count')
results.mean()
np.mean(test_statistic < results)
Out[30]:
0.0
No description has been provided for this image

Since the test value is much less than 0.05, this is a statistically significant value, meaning that it is very unlikely that this relation is due to chance. However, even though it is statiscally significant, the difference between the average ratings is still very small and that means that distance doesn't have much of an impact on the overall average rating.

Lets check out the average ratings of these restaurants per university to see if we can find a difference from the above result. First lets see the number of close/far restaurants we have per university. (Note that since we only chose one university per city, the city name works as a groupby value.)

In [31]:
rb_counts = pd.DataFrame(review_businesses.groupby('city')['close_to_university'].value_counts())
rb_counts
Out[31]:
count
city close_to_university
Edmonton True 786
False 309
Indianapolis True 2382
False 1290
Nashville True 3874
False 1549
New Orleans True 7911
False 508
Philadelphia True 9820
False 2797
Reno True 2271
False 664
Santa Barbara False 2125
True 2
St. Louis False 306
True 79
Tampa False 3385
True 768
Tucson True 2355
False 1137

For most universities/colleges in our above list, there is a big disparity between the number of restaurants that are close versus restaurants that are far.

In [32]:
rb_avgs = pd.DataFrame(review_businesses.groupby(['city', 'close_to_university'])['avg_rating'].mean())
rb_avgs
Out[32]:
avg_rating
city close_to_university
Edmonton False 3.500000
True 3.666667
Indianapolis False 3.724031
True 3.930940
Nashville False 3.559716
True 3.809112
New Orleans False 3.820866
True 3.938693
Philadelphia False 3.747408
True 3.790733
Reno False 3.572289
True 3.854690
Santa Barbara False 3.900000
True 3.500000
St. Louis False 3.767974
True 2.955696
Tampa False 3.771935
True 3.769531
Tucson False 3.683377
True 3.859236

For most universities/colleges above, it seems that restuarants that are further from the university/college are recieving lower ratings on average than restaurants that are close.

Linear Regression on Distance vs Ratings¶

We can perform linear regression to see if distance from a chosen university is a predictor for Yelp ratings.

First let's create a new column for our dataframe that contains the distance from a university/college. We can adapt our calc_distance function from before to calculate distance by removing the threshold check.

In [33]:
def calc_distance(df):
    new_df = df.copy()
    lat, long = universities.loc[df['city'], ['Latitude', 'Longitude']]
    lat1, long1 = df[['latitude', 'longitude']]
    new_df['distance'] = haversine((lat, long), (lat1, long1))
    return new_df
In [34]:
business_lg = business.apply(calc_distance, axis=1)
In [35]:
business_lg
Out[35]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours close_to_university distance
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 4.0 80 1 {'RestaurantsDelivery': 'False', 'OutdoorSeati... Restaurants, Food, Bubble Tea, Coffee & Tea, B... {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... True 4.050
9 bBDDEgkFA1Otx9Lfe7BZUQ Sonic Drive-In 2312 Dickerson Pike Nashville TN 37207 36.208102 -86.768170 1.5 10 1 {'RestaurantsAttire': ''casual'', 'Restaurants... Ice Cream & Frozen Yogurt, Fast Food, Burgers,... {'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '... True 4.010
12 il_Ro8jwPlHresjw9EGmBg Denny's 8901 US 31 S Indianapolis IN 46227 39.637133 -86.127217 2.5 28 1 {'RestaurantsReservations': 'False', 'Restaura... American (Traditional), Restaurants, Diners, B... {'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ... True 4.958
15 MUTTqe8uqyMdBl186RmNeA Tuna Bar 205 Race St Philadelphia PA 19106 39.953949 -75.143226 4.0 245 1 {'RestaurantsReservations': 'True', 'Restauran... Sushi Bars, Restaurants, Japanese {'Tuesday': '13:30-22:0', 'Wednesday': '13:30-... False 5.421
19 ROeacJQwBeh05Rqg7F6TCg BAP 1224 South St Philadelphia PA 19147 39.943223 -75.162568 4.5 205 1 {'NoiseLevel': 'u'quiet'', 'GoodForMeal': '{'d... Korean, Restaurants {'Monday': '11:30-20:30', 'Tuesday': '11:30-20... True 3.281
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
150319 8n93L-ilMAsvwUatarykSg Kitchen Gia 3716 Spruce St Philadelphia PA 19104 39.951018 -75.198240 3.0 22 0 {'RestaurantsGoodForGroups': 'True', 'BikePark... Coffee & Tea, Food, Sandwiches, American (Trad... {'Monday': '9:0-19:30', 'Tuesday': '9:0-19:30'... True 0.698
150321 AM7O0cwkxm6w_e0Q7-f9FQ Starbucks 8817 S US-31 Indianapolis IN 46227 39.638245 -86.128069 4.0 29 1 {'RestaurantsPriceRange2': '1', 'Caters': 'Fal... Food, Coffee & Tea {'Monday': '6:0-21:0', 'Tuesday': '6:0-21:0', ... True 4.864
150322 2MAQeAqmD8enCT2ZYqUgIQ The Melting Pot - Nashville 166 2nd Ave N, Ste A Nashville TN 37201 36.163875 -86.776311 4.0 204 0 {'RestaurantsDelivery': 'False', 'RestaurantsR... Fondue, Beer, Wine & Spirits, Food, Restaurants {'Monday': '0:0-0:0', 'Tuesday': '16:0-21:0', ... True 3.088
150336 WnT9NIzQgLlILjPT0kEcsQ Adelita Taqueria & Restaurant 1108 S 9th St Philadelphia PA 19147 39.935982 -75.158665 4.5 35 1 {'WheelchairAccessible': 'False', 'Restaurants... Restaurants, Mexican {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... True 3.734
150340 hn9Toz3s-Ei3uZPt7esExA West Side Kebab House 2470 Guardian Road NW Edmonton AB T5T 1K8 53.509649 -113.675999 4.5 18 0 {'Ambience': '{'touristy': False, 'hipster': F... Middle Eastern, Restaurants {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'... False 16.999

25000 rows × 16 columns

In [36]:
review_business_lg = pd.merge(review_student, business_lg, how='inner', on='business_id')
review_business_lg = review_business_lg[['city', 'name', 'stars_x', 'student_or_not', 'stars_y', 'review_count', 'close_to_university', 'distance']]
review_business_lg = review_business_lg.rename(columns={'name': 'restaurant_name', 'stars_x': 'rating', 'stars_y': 'avg_rating'})

Let's check the distribution on distances.

In [37]:
sns.histplot(data=business_lg['distance'])

f1 = plt.gcf()
No description has been provided for this image

It looks like there's an outlier, lets remove it and check the distribution again.

In [38]:
business_lg = business_lg.drop(business_lg['distance'].idxmax())
In [39]:
sns.histplot(data=business_lg['distance'])
f1 = plt.gcf()
No description has been provided for this image

Let's check the scatterplot to see if we can spot a linear relation between the Yelp ratings of restaurants and the distance they are from a university/college.

In [40]:
sns.scatterplot(data=review_business_lg, y='rating', x='distance')
Out[40]:
<AxesSubplot:xlabel='distance', ylabel='rating'>
No description has been provided for this image
In [41]:
sns.scatterplot(data=review_business_lg, x='distance', y='avg_rating')
Out[41]:
<AxesSubplot:xlabel='distance', ylabel='avg_rating'>
No description has been provided for this image

From the scatterplot there seems to be no relation between Yelp ratings and distance as there is about the same amount of low and high ratings for restaurants of all distances from an university/college. (Note that the scatterplot looks like this because the Yelp ratings are at hard intervals) If we were to draw a line between these points it should look flat. Lets check using linear regression.

In [42]:
outcome, predictors = patsy.dmatrices('rating ~ distance', review_business_lg)
mod = sm.OLS(outcome, predictors)
res_1 = mod.fit()
In [43]:
print(res_1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     41.19
Date:                Tue, 19 Mar 2024   Prob (F-statistic):           1.40e-10
Time:                        21:53:49   Log-Likelihood:                -74091.
No. Observations:               44318   AIC:                         1.482e+05
Df Residuals:                   44316   BIC:                         1.482e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.8933      0.008    459.921      0.000       3.877       3.910
distance      -0.0078      0.001     -6.418      0.000      -0.010      -0.005
==============================================================================
Omnibus:                     5044.150   Durbin-Watson:                   1.701
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6916.979
Skew:                          -0.962   Prob(JB):                         0.00
Kurtosis:                       2.795   Cond. No.                         9.68
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [44]:
outcome_2, predictors_2 = patsy.dmatrices('avg_rating ~ distance', review_business_lg)
mod_2 = sm.OLS(outcome_2, predictors_2)
res_2 = mod_2.fit()
In [45]:
print(res_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             avg_rating   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     351.9
Date:                Tue, 19 Mar 2024   Prob (F-statistic):           3.34e-78
Time:                        21:53:50   Log-Likelihood:                -36320.
No. Observations:               44318   AIC:                         7.264e+04
Df Residuals:                   44316   BIC:                         7.266e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.8591      0.004   1069.028      0.000       3.852       3.866
distance      -0.0098      0.001    -18.758      0.000      -0.011      -0.009
==============================================================================
Omnibus:                     7562.557   Durbin-Watson:                   0.181
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            14489.829
Skew:                          -1.058   Prob(JB):                         0.00
Kurtosis:                       4.835   Cond. No.                         9.68
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In both linear regressions, the first trying to predict review rating, and the second trying to predict a restaurant's average rating, both using distance, we see a low pvalue in the distance row meaning that it is statistically significant and likely nonzero. However the coefficient for distance is still very small, despite the value being statistically significant so distance doesn't have much of an effect on reveiw rating or average rating.

However, it might be the case for the average ratings, that because there are many repeated values, it could affect the model. Lets revert back to just the business_lg dataframe and run linear regression using that.

In [46]:
outcome_3, predictors_3 = patsy.dmatrices('stars ~ distance', business_lg)
mod_3 = sm.OLS(outcome_3, predictors_3)
res_3 = mod_3.fit()
In [47]:
print(res_3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  stars   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     193.8
Date:                Tue, 19 Mar 2024   Prob (F-statistic):           6.83e-44
Time:                        21:53:50   Log-Likelihood:                -30688.
No. Observations:               24999   AIC:                         6.138e+04
Df Residuals:                   24997   BIC:                         6.140e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.6597      0.008    450.330      0.000       3.644       3.676
distance      -0.0135      0.001    -13.922      0.000      -0.015      -0.012
==============================================================================
Omnibus:                     1452.109   Durbin-Watson:                   2.011
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1721.998
Skew:                          -0.643   Prob(JB):                         0.00
Kurtosis:                       2.995   Cond. No.                         13.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Again we see a pvalue of 0! This means that we are pretty confidant in our coefficient. However, the coefficient for distance is -0.0014, a value that is pretty much 0. So even in the case that we are confident, distance barely affects the Yelp rating of a restaurant (when we work in a scale of 0.5, changes on a scale of 0.0014 don't really matter).

Lets check these coefficients per university to see if this is different for some specific universities.

In [48]:
# do linear regression per city, since we only picked one university/college per city
for city in review_business_lg['city'].unique():
    out, pred = patsy.dmatrices('avg_rating ~ distance', review_business_lg[review_business_lg['city']==city])
    mod = sm.OLS(out, pred)
    res = mod.fit()
    print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])
Tucson distance coef, pvalue: -0.02348014710517908 1.973349041034627e-29
Philadelphia distance coef, pvalue: 0.002771896479638284 0.060048907543371485
New Orleans distance coef, pvalue: -0.019801151691254182 4.4198834660608146e-13
Santa Barbara distance coef, pvalue: -0.01908444339580161 6.537725936687052e-10
Indianapolis distance coef, pvalue: -0.028078788478805405 2.5607005144326463e-33
Tampa distance coef, pvalue: 0.000914553749047525 0.5695756118487593
Nashville distance coef, pvalue: -0.03489970549495028 8.052083656934554e-101
Reno distance coef, pvalue: -0.06509373125880094 5.080044002626258e-59
Edmonton distance coef, pvalue: -0.044720669841808114 2.2169385782843615e-14
St. Louis distance coef, pvalue: 0.05239540816671164 0.0006505199988430329
In [49]:
for city in business_lg['city'].unique():
    out, pred = patsy.dmatrices('stars ~ distance', business_lg[business_lg['city']==city])
    mod = sm.OLS(out, pred)
    res = mod.fit()
    print(city, 'distance coef, pvalue:', res.params[1], res.pvalues[1])
Philadelphia distance coef, pvalue: -0.021001644673470512 6.938766368582747e-23
Nashville distance coef, pvalue: -0.02282392993556582 6.193023588559855e-12
Indianapolis distance coef, pvalue: -0.027594791297863783 7.873703222892081e-21
Edmonton distance coef, pvalue: -0.0226297324370053 1.2314762213914875e-08
Reno distance coef, pvalue: -0.03741402462439515 4.5181943588653566e-08
Tucson distance coef, pvalue: -0.017036831481580986 1.9837803869667264e-08
Tampa distance coef, pvalue: -0.005786451965124915 0.03325085579968091
Santa Barbara distance coef, pvalue: 0.027875758165070933 3.2117868091776005e-05
New Orleans distance coef, pvalue: -0.02132569106730503 3.8296498449916786e-05
St. Louis distance coef, pvalue: 0.013749277877485347 0.1674120455876262

Even separated among specific universities/colleges, we see mostly the same results. It is dominated by the fact that the coefficient for distance is still extremely small, thus distance doesn't have much of an effect on rating.

Student vs Non-student Reviews¶

Ultimately, we want to determine how students from these campuses impact these restaurants, i.e. do they increase or decrease restaurant's Yelp ratings? Let's start by checking the reviews of the restaurants that are close. How many are of these reviews are from students and how many are from non students?

In [50]:
close_reviews=review_businesses[review_businesses['close_to_university'] == True]
is_student=close_reviews[close_reviews['student_or_not']==True].shape[0]
print(is_student,'of the reviews of restaurants that are close to campus are from students')
not_student=close_reviews[close_reviews['student_or_not']==False].shape[0]
print(not_student, 'of the reviews of restaurants that are close to campus are not from students')
14069 of the reviews of restaurants that are close to campus are from students
16179 of the reviews of restaurants that are close to campus are not from students

What is the average Yelp rating among the reviews from students? What is the average Yelp rating among the reviews from non students?

In [51]:
avg_rating_student=close_reviews[close_reviews['student_or_not']==True]['avg_rating'].mean()
print(avg_rating_student,'is the average Yelp rating among the reviews from students')
avg_rating_nonstudent=close_reviews[close_reviews['student_or_not']==False]['avg_rating'].mean()
print(avg_rating_nonstudent,' is the average Yelp rating among the reviews from non students')
3.829945269741986 is the average Yelp rating among the reviews from students
3.861827059768836  is the average Yelp rating among the reviews from non students

According to the previous calculation, we can see that the average ratings from non-students is slightly higher than the average ratings from non-students(the difference is smaller than 1). In order to further compare between the mean ratings of reviews from students and non-students, we decided to perform a permutation test to assess the statistical significance of the difference. The test statistic in this permutation test would be the difference in the average ratings between students and non-students from our dataset.

In [52]:
test_statistic = review_businesses[review_businesses['student_or_not'] == False]['avg_rating'].mean() - review_businesses[review_businesses['student_or_not'] == True]['avg_rating'].mean()
test_statistic
Out[52]:
0.02512900573164778

By comparing the original test statistic to the distribution of the test statistics from the permuted samples, we can determine how extreme the original difference is.

In [53]:
def permutation_tests():
    diff_array = list()
    shuffled_df = review_businesses.copy()
    for i in range (1000):
        shuffled_df['student_or_not'] = np.random.permutation(shuffled_df['student_or_not'])
        student = shuffled_df[shuffled_df['student_or_not'] == True]['avg_rating'].mean()
        nonstudent = shuffled_df[shuffled_df['student_or_not'] == False]['avg_rating'].mean()
        diff_array.append((nonstudent-student))
    return np.array(diff_array)

results = permutation_tests()
plt.hist(results)
plt.xlabel('Difference in Means')
plt.ylabel('Count')
results.mean()
np.mean(test_statistic <= results)
Out[53]:
0.0
No description has been provided for this image

The p-value from the permutation test is 0.001 which is way below the 0.05/0.01 significance level. Therefore, we can conclude that the difference in means between average ratings from student and non-student reviews is statistically significant, meaning that it's unlikely the observed difference occurred by random chance. However, the difference could be statistically significant but still small or unimportant in practical terms given the difference of observed ratings between the two groups is very small (or almost negligible).

We can again split the analysis of reviews up by university to see if some universities have different results from others.

In [54]:
city_to_university = universities['University Name'].to_dict()

review_businesses['university_name'] = review_businesses['city'].map(city_to_university)

review_businesses
Out[54]:
city restaurant_name rating student_or_not avg_rating review_count close_to_university university_name
0 Tucson Kettle Restaurant 3 True 3.5 47 True University of Arizona Tuscon
1 Tucson Kettle Restaurant 2 True 3.5 47 True University of Arizona Tuscon
2 Tucson Kettle Restaurant 5 False 3.5 47 True University of Arizona Tuscon
3 Tucson Kettle Restaurant 5 False 3.5 47 True University of Arizona Tuscon
4 Tucson Kettle Restaurant 3 False 3.5 47 True University of Arizona Tuscon
... ... ... ... ... ... ... ... ...
44313 Edmonton Dairy Queen Grill & Chill 1 True 2.0 6 True University of Alberta
44314 Tampa Grand China 5 False 3.5 19 False University of South Florida
44315 Philadelphia Dough Boy Pizza 5 False 4.5 11 False University of Pennsylvania
44316 Tucson Burger King 3 False 1.5 21 True University of Arizona Tuscon
44317 Edmonton Versato's Pizza 5 False 4.5 24 False University of Alberta

44318 rows × 8 columns

In [55]:
university_city_groups = review_businesses.groupby(['university_name', 'city'])

university_city_averages = []

for (university, city), group in university_city_groups:
    avg_rating_students = group[group['student_or_not'] == True]['rating'].mean()
    avg_rating_nonstudents = group[group['student_or_not'] == False]['rating'].mean()
    
    university_city_averages.append({
        'University Name': university,
        'City': city,
        'Average Rating from Students': avg_rating_students,
        'Average Rating from Non-Students': avg_rating_nonstudents
    })

summary_table = pd.DataFrame(university_city_averages)

summary_table
Out[55]:
University Name City Average Rating from Students Average Rating from Non-Students
0 Purdue University Indianapolis 3.831654 3.903778
1 Tulane University New Orleans 3.855110 4.061557
2 UC Santa Barbara Santa Barbara 3.801944 4.063650
3 University of Alberta Edmonton 3.608130 3.672917
4 University of Arizona Tuscon Tucson 3.844771 3.876812
5 University of Nevada Reno 3.737705 3.984840
6 University of Pennsylvania Philadelphia 3.751565 3.905575
7 University of South Florida Tampa 3.674469 3.886010
8 Vanderbilt University Nashville 3.602155 3.892362
9 Washington University in St. Louis St. Louis 3.705263 3.635897

Looking at the overall average ratings from students and non-students, we can observe that in 9 out of the 10 cities, the average ratings were lower when they were submitted from students compared to the ratings submitted from non-students. We also want to emphasize how the different is not significantly lower, so this shows that students may have an effect, but the effect isn't big.

Ethics & Privacy¶

We need inclusive data representation, by ensuring the dataset adequately represents different types of restaurants and doesn’t exclude any significant group. We must ensure that the data sources that we use comply with ethical standards and privacy laws. The data should be publicly available and not include any personal information about individuals who submitted the ratings. We must review the data and results to monitor for biases.

The data on Yelp and Google Maps is self-uploaded, so some restaurants may not appear. Since the data is self-uploaded, we believe the restaurants would be open to having other people see it. The restaurants are self-uploaded. After a restaurant is uploaded, they cannot be taken down unless the restaurant has closed down. Since closed restaurants do not show up, the data could be biased. For instance, restaurants that have closed down could potentially have lower ratings, but these lower ratings are no longer part of the data. Another bias could be with online reviews in general. Restaurants could encourage positive reviews by offering a discount or a free dessert to customers. Another concern is that people who have a negative experience might feel frustrated and post negative reviews while a person who has a great experience has no problems, so they also don’t feel the need to make a review. Negative reviews may be overrepresented.

We must develop a methodology that addresses these issues, such as figuring out how we are going to include each level of budget, restaurant type, etc. We have to be sure to exclude personal identifiers (such as names) for those who submitted reviews. We also have to make sure to remove the reviewer’s names. We will include restaurants in the main price categories and all cuisines that are ordered to avoid bias.

Discussion and Conclusion¶

Our comprehensive data analysis so far has provided us with insightful revelations about the relationship between a restaurant’s distance to universities/colleges campuses in different cities and its Yelp ratings, as well as the influence of student reviewers on these ratings. After wrangling the raw dataset via various methods, we finalized it into a comprehensive dataframe that contains the important columns of information that we wanted like ‘is the restaurant close to campus or not’, ‘is the reviewers a student or not’ and ‘yelp ratings of these restaurants’ etc. And then our team conducted a series of data analysis on this dataframe using relevant statistical techniques and methods including performing permutation test,t-test and linear regression. Based on what we had so far, we can conclude that the distance of restaurants from campuses bears very minimal correlation with their Yelp ratings, suggesting that distance of restaurants from campuses is not necessarily important in how customers perceive or rate these restaurants. Besides, the analysis indicates that ratings submitted by students would lower the average ratings of these restaurants and that’s a general trend across the 10 universities that we had for our analysis, but the influence is too small to be actually taken into consideration in practical terms.

However, we do believe that there are several limitations in regards to the scope and outcomes of our analysis. One of the primary constraints was the selection of only 10 cities and universities for our analysis. This choice was dictated by time constraints and the fact that only 49% of the top 29 cities in our dataset met our research criteria. Consequently, this limitation raises questions about the generalizability of our findings to other cities and university contexts worldwide. Additionally, the method we used to distinguish between student and non-student reviews faced challenges, with a notable presence of false positives and negatives. This issue potentially affected the accuracy of our interpretations. Expanding the dataset to include a more diverse range of cities and improving the accuracy of identifying student reviews are helpful for better future research. These enhancements would not only address the limitations of our current research but also provide a more comprehensive understanding of how location and customer demographics influence restaurant ratings on platforms like Yelp.

We believe our work contributes to a broader understanding of what influences students’ patronage at restaurants. While other previous studies have focused on factors like food quality, service,decor and price, our research added a new dimension by exploring the spatial aspect of dining preferences. Future research can be done based on our work to investigate more about the factors that might influence consumer behaviors and to provide inspiration for restaurant owners on how to attract more customers.

Team Contributions¶

  • Carmen Truong - Finding dataset, Reviews dataset wrangling, Close/Far average business rating eda, conclusion of video
  • Zifan Luo - Finding dataset, Reviews dataset wrangling, Student vs Non-student reviews eda, conclusion
  • Rabih Siddiqui - Compiled universities dataset, Student vs Non-student reviews eda, editing video, ethics and privacy
  • Lewis Weng - Businesses dataset wrangling, close/far average business rating eda, ethics and privacy, abstract
  • Jacob Lin - Businesses dataset wrangling, businesses linear regression, intro of video, compiling work and uploading notebooks