Introduction

Dataset and Scraping Information

The data used in this analysis was scraped from Pitchfork.com using a program written in Python with the BeautifulSoup package. Important features relating to each review such as score, date, and genre, along with the text of the review itself, were gathered. The entire corpus of the website through the date of scraping (December 6th, 2017) was collected, nearly 20,000 reviews dating back to January 5th, 1999. View and download the data on Kaggle here.

library(tidyverse)
library(plotly)
p4k <- read.csv("C:/Users/Evan/Documents/p4kdata/p4kreviews.csv")
glimpse(p4k)
## Observations: 19,555
## Variables: 8
## $ X      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ album  <fctr> A.M./Being There, No Shame, Material Control, Weighing...
## $ artist <fctr> Wilco, Hopsin, Glassjaw, Nabihah Iqbal, Neil Young / P...
## $ best   <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1...
## $ date   <fctr> December 6 2017, December 6 2017, December 6 2017, Dec...
## $ genre  <fctr> Rock, Rap, Rock, Pop/R&B, Rock, Pop/R&B, Pop/R&B, Rap,...
## $ review <fctr> Best new reissue 1 / 2 Albums Newly reissued and remas...
## $ score  <dbl> 7.0, 3.5, 6.6, 7.7, 6.7, 9.0, 5.8, 6.2, 5.3, 8.1, 4.1, ...

Time Series Processing

Initially, we only have each date as a numeric string. We can do some feature engineering to turn this into individual week, month, and day variables with the lubridate package, as well as some more sophisticated features like day of the week and week of the month for use in a later plot.

#m, d, y
library(lubridate)
p4k$date <- gsub(" ", "-", p4k$date)
p4k$date <- mdy(p4k$date)
p4k <- p4k %>% mutate(year = year(date), month = month(date), day = day(date))
#advanced date manipulation
library(scales)
library(zoo)
p4k$yearmonth <- as.yearmon(p4k$date)
p4k$yearmonthf <- factor(p4k$yearmonth)
p4k$week <- week(p4k$date)
p4k <- plyr::ddply(p4k,plyr::.(yearmonthf), transform, monthweek=1+week-min(week))
p4k$weekday <- factor(weekdays(p4k$date))
p4k$month <- factor(month.abb[p4k$month]) #turn from numbers to abbreviations

Exploratory Analysis

Score Analysis

We can start off by visualizing the score distribution over all reviews, with the average being right around a 7.

ggplotly(p4k %>% 
  ggplot(aes(x=score)) + 
  geom_histogram(bins=50) + 
  geom_vline(xintercept=mean(p4k$score)) + 
  theme_bw() + 
  scale_x_continuous(breaks = c(1:10)) + 
  labs(x = "Score",y="Count"))

We can also observe how the scores change over time. Though the month and day plots aren’t especially illuminating besides highlighting the inherent variation in scoring over long periods of time, we can notice a general upwards trend of review scores since around 2009 in the years plot.

ggplotly(p4k %>% 
  group_by(year) %>% 
  mutate(avg_score = mean(score)) %>% 
  ggplot(aes(x = year, y = avg_score)) + 
  geom_line() + 
  scale_x_continuous(breaks=seq(1999, 2017, by = 2)))