Sys.setenv(SPOTIFY_CLIENT_ID = SPOTIFY_CLIENT_ID)
Sys.setenv(SPOTIFY_CLIENT_SECRET = SPOTIFY_CLIENT_SECRET)
<- get_spotify_authorization_code(scope = scopes()[c(1:19)]) #sets an authorization code that you'll need to provide for certain get_ functions via my_tracks <- get_my_saved_tracks(authorization = authorization_code)
authorization_code
<- get_spotify_access_token() #takes ID and SECRET, sends to Spotify and receives an access token access_token
Project Overview
In this project, I am interested in building my machine learning skills through a fun exercise with Spotify!
Using my liked songs and my friends, can I build a machine learning model to predict whose playlist a song belongs to?
Data & Methods
For this project, I will be requesting data from the Spotify API using the {spotifyr}
package and using the data to build k-nearest neighbor and decision tree models.
Data Preparation
Using get_my_saved_tracks()
, the Spotify API returns a dataframe of tracks and associated attributes. However, it will only return up to 50 tracks at a time, so I will need to make multiple requests. To do this, I will use a function to combine all the requests in one call.
= seq(from = 0, to = 150, by = 50)
offsets
#initializing an empty df
<- data.frame(matrix(nrow = 0, ncol = 30))
my_tracks
# function to get my 150 most recently liked tracks
for (i in seq_along(offsets)) {
= get_my_saved_tracks(authorization = authorization_code, limit = 50,
liked_tracks offset = offsets[i])
= as.data.frame(liked_tracks) # creating a temporary data frame
df_temp <- rbind(my_tracks, df_temp) # binding the temporary data frame to my tracks data frame
my_tracks }
Additionally, by giving the API a list of track IDs using get_track_audio_features(), I can get an audio features dataframe of all the tracks and some attributes of them.
<- get_track_audio_features(my_tracks$track.id[1:100])
audio1 <- get_track_audio_features(my_tracks$track.id[101:200])
audio2
<- rbind(audio1, audio2) audio_features
These track audio features are the predictors we are interested in, but this dataframe doesn’t have the actual names of the tracks, so I need to append the ‘track.name’ column from my favorite tracks dataframe.
<- my_tracks %>%
sam_audio select(track.name) %>%
bind_cols(audio_features) %>%
mutate(name = "sam")
One of my friends followed these same steps, and she sent me a .csv of her track audio features, which I will bind with my audio features dataframe.
# read in anna audio
<- read_csv(here::here("projects/2024-04-13-spotify-predictions/named_audio_anna.csv")) %>%
anna_audio mutate(name = "anna")
# bind mine and anna's data
<- rbind(sam_audio, anna_audio) both_audio
Data Exploration
Now that the data is ready to go, let’s do a little data exploration. There are a lot of cool audio features to explore.
ggplot(both_audio, aes(x = danceability, fill = name)) +
geom_density(alpha=0.5) +
scale_fill_manual(values=c("magenta4", "seagreen"))+
labs(x="Danceability", y="Density", title = "Distribution of Music Danceability Data") +
guides(fill=guide_legend(title="Listener")) +
theme_minimal()
ggplot(both_audio, aes(x = energy, fill = name)) +
geom_density(alpha=0.6) +
scale_fill_manual(values=c("magenta4", "seagreen"))+
labs(x="Energy", y="Density", title = "Distribution of Music Energy Data") +
guides(fill=guide_legend(title="Listener")) +
theme_minimal()
rbind(anna_audio[which.max(anna_audio$danceability),],
which.max(sam_audio$danceability),]) %>%
sam_audio[select(track.name, danceability, name) %>%
::gt(caption = "Liked tracks with the highest danceability") gt
track.name | danceability | name |
---|---|---|
Chilling At Nemu's Place | 0.943 | anna |
Als ich ein Kind war | 0.948 | sam |
Building Machine Learning Models
Modeling Prep
Further preparation is needed before running the models. First, I need to remove unnecessary columns including track urls and the track name. Next I’ll split the data into training and testing sets, using a 75:25 split. Finally, I’ll create my recipe, normalizing the nominal and numeric predictors and prep.
# prepare data ----
<- both_audio %>%
all_tracks_modeling mutate_if(is.ordered, .funs = factor, ordered = F) %>%
select(-track.name, -type, -id, -uri, -track_href, -analysis_url) %>%
mutate(name = as.factor(name))
# splitting the data ----
set.seed(123)
<- initial_split(all_tracks_modeling, prop = 0.75)
tracks_split <- training(tracks_split)
tracks_train <- testing(tracks_split)
tracks_test
# create a recipe ----
<- recipe(name ~ ., data = tracks_train) %>%
tracks_recipe step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors()) %>% #normalize numeric to make sure scale is okay
prep()
K-Nearest Neighbor
The first model I want to build uses k-nearest neighbor. This is a classification problem (classifying the track as either belonging to my liked songs or my friends liked songs), so I will set_mode()
to “classification” and set_engine()
to “kknn”.
# Define our KNN model with tuning ----
<- nearest_neighbor(neighbor = tune()) %>%
knn_spec_tune set_mode("classification") %>%
set_engine("kknn")
# Check the model
#knn_spec_tune
Next I need to define my workflow by adding the model and recipe I defined above.
# Define a new workflow ----
<- workflow() %>%
wf_knn_tune add_model(knn_spec_tune) %>%
add_recipe(tracks_recipe)
To hopefully increase the accuracy of my model, I will use 10 fold cross validation to further split and train on my data.
# 10-fold CV on the training dataset ----
set.seed(123)
<- tracks_train %>%
cv_folds vfold_cv(v = 10)
Now that everything is ready to go, I can fit the workflow on the folds and check out the parameter tuning.
# Fit the workflow on our predefined folds and hyperparameters ----
<- wf_knn_tune %>%
fit_knn_cv tune_grid(resamples = cv_folds,
grid = 10)
# Check the performance with collect_metrics()
#collect_metrics(fit_knn_cv)
# plot cv results for parameter tuning ----
autoplot(fit_knn_cv) +
theme_bw()
Using the fit workflow, I can finalize and select the best iteration based on the ROC AUC and predict on the training and testing data.
# The final workflow for our KNN model ----
<- wf_knn_tune %>%
final_knn_wf finalize_workflow(select_best(fit_knn_cv, metric = "roc_auc"))
<- fit(final_knn_wf, tracks_train)
train_knn_fit
<- predict(object = train_knn_fit, new_data = tracks_train) %>% #predict the training set
train_predict bind_cols(tracks_train) #bind training set column to prediction
<- predict(train_knn_fit, tracks_test) %>% #get prediction probabilities for test
test_knn_predict bind_cols(tracks_test) %>% #bind to testing column
mutate(name = as.factor(name))
With my model now finalized, we can look at the accuracy of the training predictions compared to the testing predictions.
# report the accuracy for the training and testing ----
accuracy(train_predict, truth = name, estimate = .pred_class) #get training accuracy
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.823
accuracy(test_knn_predict, truth = name, estimate = .pred_class) #get accuracy of testing prediction
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.63
Here is the optimized workflow for the KNN model along with the result metrics.
# Optimizing workflow ----
<- final_knn_wf %>%
final_knn_fit last_fit(tracks_split)
# last_fit() fit on the training data but then also evaluates on the testing data
<- last_fit(final_knn_wf, tracks_split)
final_knn_result
# testing predictions and metrics ----
#final_knn_result$.predictions
<- as.data.frame(final_knn_result$.predictions) %>%
knn_predict_data bind_cols(tracks_test)
$.metrics final_knn_result
[[1]]
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.63 Preprocessor1_Model1
2 roc_auc binary 0.661 Preprocessor1_Model1
Decision Tree
Let’s build another model! This time I am going to be building a decision tree and tuning model hyperparameters. The hyperparameters I have chosen to tune are cost complexity, tree depth, and the minimum number of data points in a node.
#new spec, tell the model that we are tuning hyperparams ----
<- decision_tree(
tree_spec_tune cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
%>%
) set_engine("rpart") %>%
set_mode("classification")
<- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)
tree_grid #head(tree_grid)
Now I can define a new workflow and set up the k-folds
# setting up the workflow ----
<- workflow() %>%
wf_tree_tune add_recipe(tracks_recipe) %>%
add_model(tree_spec_tune)
# set up k-fold cv ----
<- tracks_train %>%
tracks_cv vfold_cv(v=10)
#tracks_cv
Since the decision tree is more computationally expensive, I am going to be using parallel processing to help with grid tuning computation time.
# build trees ----
::registerDoParallel() #build trees in parallel
doParallel<- tune_grid(
tree_rs
wf_tree_tune,resamples = tracks_cv,
grid = tree_grid,
metrics = metric_set(accuracy)
)
# Use autoplot() to examine how different parameter configurations relate to accuracy ----
autoplot(tree_rs) +
theme_bw()
With the grid tuning finished I can select the best hyperparameters, finalize the workflow, fit the data, and make predictions on the testing data.
# select hyperparameter ----
#show_best(tree_rs) # showing
select_best(tree_rs) # what we'll input
# A tibble: 1 × 4
cost_complexity tree_depth min_n .config
<dbl> <int> <int> <chr>
1 0.0000000001 4 30 Preprocessor1_Model081
# finalize the model specification where we have replaced the tune functions with optimized values ----
<- finalize_workflow(wf_tree_tune, select_best(tree_rs))
final_tree #final_tree
# final fitting ----
<- fit(final_tree, data = tracks_train)
final_tree_fit
# last_fit() fit on the training data but then also evaluates on the testing data
<- last_fit(final_tree, tracks_split)
final_tree_result
# testing predictions and metrics ----
#final_tree_result$.predictions
$.metrics final_tree_result
[[1]]
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.65 Preprocessor1_Model1
2 roc_auc binary 0.626 Preprocessor1_Model1
<- predict(final_tree_fit, tracks_test) %>% #get prediction probabilities for test
predict_tree_data bind_cols(tracks_test) %>% #bind to testing column
mutate(name = as.factor(name))
#Visualize variable importance ----
%>%
final_tree_fit vip(geom = "point") +
theme_bw()
Bagged Tree
The third model I am going to build is a bagged tree model where I tune the cost complexity and the minimum number of data points in a node.
# set up the bagged tree model ----
<- bag_tree(
bagged_tree cost_complexity = tune(),
min_n = tune()
%>%
) set_engine("rpart", times = 50) %>%
set_mode("classification")
Setting up the workflow and grid tuning.
# workflow ----
<- workflow() %>%
wf_bag_tune add_recipe(tracks_recipe) %>%
add_model(bagged_tree)
# set up the tuning ----
<- grid_regular(cost_complexity(), min_n(), levels = 5)
bag_grid
<- tune_grid(
bag_rs
wf_bag_tune,resamples = tracks_cv,
grid = bag_grid,
metrics = metric_set(accuracy))
Select the best hyperparameters, finalize the workflow, fit, and predict.
# select hyperparameter ----
#show_best(bag_rs) # showing
select_best(bag_rs) # what we'll input
# A tibble: 1 × 3
cost_complexity min_n .config
<dbl> <int> <chr>
1 0.00000316 11 Preprocessor1_Model08
# finalize the model specification ----
<- finalize_workflow(wf_bag_tune, select_best(bag_rs))
final_bag #final_bag
# final fitting ----
<- fit(final_bag, data = tracks_train)
final_bag_fit
# last_fit() fit on the training data but then also evaluates on the testing data
<- last_fit(final_bag, tracks_split)
final_bag_result
# testing predictions and metrics ----
<- predict(final_bag_fit, tracks_test) %>% #get prediction probabilities for test
bag_data bind_cols(tracks_test) %>% #bind to testing column
mutate(name = as.factor(name))
#final_bag_result$.predictions
# report final metrics ----
$.metrics final_bag_result
[[1]]
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.7 Preprocessor1_Model1
2 roc_auc binary 0.756 Preprocessor1_Model1
Random Forest
Finally, I am going to be building a random forest model where I tune the number of trees and the number of predictors that will be randomly sampled at each split.
# random forest ----
<- rand_forest(mtry = tune(),
rf_model trees = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
Contain the model and recipe in a workflow and grid tuning.
# workflow ----
<- workflow() %>%
rf_workflow add_model(rf_model) %>%
add_recipe(tracks_recipe)
# parameter tuning ----
<- rf_workflow %>%
rf_cv_tune tune_grid(resamples = cv_folds, grid = 10) #use cross validation to tune mtry and trees parameters
#get metrics from tuning cv to pick best model ----
#collect_metrics(rf_cv_tune)
#plot cv results for parameter tuning ----
autoplot(rf_cv_tune) +
theme_bw()
Finalize the workflow, fit, and predict.
# finalize workflow ----
<- show_best(rf_cv_tune, n = 1, metric = "roc_auc") #get metrics for best random forest model
rf_best rf_best
# A tibble: 1 × 8
mtry trees .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 4 1889 roc_auc binary 0.732 10 0.0332 Preprocessor1_Model04
<- finalize_workflow(rf_workflow,
rf_final select_best(rf_cv_tune, metric = "roc_auc"))
# model fitting ----
<- fit(rf_final, tracks_train) #fit the KNN model to the training set
train_fit_rf #train_fit_rf
# prediction probabilities ----
<- predict(train_fit_rf, tracks_test) %>% #get prediction probabilities for test
test_predict_rf bind_cols(tracks_test) %>% #bind to testing column
mutate(name = as.factor(name))
<- predict(train_fit_rf, tracks_test, type = "prob") %>% #get testing prediction
test_predict2_rf bind_cols(tracks_test) %>% #bind to testing column
mutate(name = as.factor(name))
# model metrics and evaluation
#accuracy(test_predict_rf, truth = name, estimate = .pred_class) #get accuracy of testing prediction
<- last_fit(rf_final, tracks_split)
rf_final_result
# testing predictions and metrics ----
#rf_final_result$.predictions
<- as.data.frame(rf_final_result$.predictions) %>%
rf_predict_data bind_cols(tracks_test)
$.metrics rf_final_result
[[1]]
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.68 Preprocessor1_Model1
2 roc_auc binary 0.769 Preprocessor1_Model1
# roc auc curve
<- roc_curve(test_predict2_rf, name, .pred_anna) test_roc_auc_rf
Comparison of model performance
Now that all of the models are built, we can compare their performance. Let’s start by pulling out the metrics from the model results and compare the accuracy of the four models.
# store the final result metrics
<- final_knn_result$.metrics[[1]]
knn_metrics <- final_tree_result$.metrics[[1]]
tree_metrics <- final_bag_result$.metrics[[1]]
bag_metrics <- rf_final_result$.metrics[[1]]
rf_metrics
# combine metrics into df for table
<- data.frame(
comparison_data Model = c("KNN", "Decision Tree", "Bagged Tree", "Random Forest"),
Accuracy = c(knn_metrics$.estimate[1], tree_metrics$.estimate[1], bag_metrics$.estimate[1], rf_metrics$.estimate[1]),
ROC_AUC = c(knn_metrics$.estimate[2], tree_metrics$.estimate[2], bag_metrics$.estimate[2], rf_metrics$.estimate[2])
)
# make table
%>%
comparison_data gt() %>%
tab_header(
title = "Model Comparison",
subtitle = "Accuracy and ROC AUC on Testing Data"
%>%
) fmt_number(
columns = c(Accuracy, ROC_AUC),
decimals = 3
)
Model Comparison | ||
Accuracy and ROC AUC on Testing Data | ||
Model | Accuracy | ROC_AUC |
---|---|---|
KNN | 0.630 | 0.661 |
Decision Tree | 0.650 | 0.626 |
Bagged Tree | 0.700 | 0.756 |
Random Forest | 0.680 | 0.769 |
Additionally, we can visualize our model comparison and look at the confusion matrices for each model.
# pivot data for plottig
<- comparison_data %>%
comparison_data_long pivot_longer(cols = c(Accuracy, ROC_AUC), names_to = "Metric", values_to = "Value")
# plotting
ggplot(comparison_data_long, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7, color = "black") +
labs(title = "Model Comparison",
y = "Metric Value") +
theme_minimal() +
scale_fill_manual(values = c("lightblue", "seagreen"))
# confusion matrix
%>%
test_knn_predict conf_mat(truth = name, estimate = .pred_class) %>%
autoplot(type = 'heatmap') +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "KNN")
%>%
test_predict_rf conf_mat(truth = name, estimate = .pred_class) %>% #create confusion matrix
autoplot(type = "heatmap") + #plot confusion matrix with heatmap
theme_bw() + #change theme
theme(axis.text.x = element_text(angle = 30, hjust=1)) +
#rotate axis labels
labs(title = "Random Forest")
%>%
bag_data conf_mat(truth = name, estimate = .pred_class) %>%
autoplot(type = 'heatmap') +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Bagged Tree")
%>%
predict_tree_data conf_mat(truth = name, estimate = .pred_class) %>%
autoplot(type = 'heatmap') +
theme_bw() +
theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Decision Tree")
# bag_conf + knn_conf + rf_conf + tree_conf
Thoughts and Conclusions
This was a fun project and got me more familiar with machine learning packages in R and modeling methods. I want to explore the Spotify data more, since there are so many more functions and attributes I haven’t explored. Happy listening!
Citation
@online{muir2024,
author = {Muir, Sam},
title = {Who’s on {AUX?} {Using} Machine Learning to Predict Whose
{Spotify} Playlist a Song Belongs To},
date = {2024-04-13},
url = {https://shmuir.github.io/projects/2024-04-13-spotify-predictions/},
langid = {en}
}