# Spo-tfidf-y

Vector embeddings are a really powerful method in data analytics and “machine learning”. If you can convert some non-quantitative data, like words, into vectors, you can use traditional data analytics methods to extract meaning from your input.

One of the most notable vector embeddings is Google’s word2vec. By mapping words 2 vectors, the smart folks at Google allowed us lowly non-Google employees to do really cool stuff with the English language! If you have some word $w_i$ and want to find a word with a similar meaning $w_s$, you can just look at your vectors and find the word with the closest distance to $w_i$ – this vector is your similar word, $w_s$! Who needs thesaurus.com when you have word2vec?

Another neat feature is the “analogy” solving feature. A common analogy (I have no idea if that’s actually the correct term for this) is “man is to woman as king is to ___”. If you didn’t get an A in English, you can use word2vec to figure this out for you! The answer $w_{answer}$ is calculated by computing $w_{king} - w_{man} + w_{woman}$. The closest vector to this combination of word vectors turns out to be the vector for queen!

## What’s with the Title

Ok it was a pretty bad joke. A common “classical” vector embedding method is called tf-idf, which stands for “term frequency - inverse document frequency”. I’ll be analyzing Spotify data in this blog post, so I thought the pun would be cute. Let’s talk through what tf-idf is before we get to the music part.

With tf-idf, we have some set of documents with words in them, and make the assumption that words in the same document are similar, or at least somewhat related to one another. There are two parts to tf-idf: tf, which stands for term frequency, and idf, which stands for inverse document frequency. The first one is the easiest. In a document $D_i$, for a given word $w_j$, the term frequency of $w_j$ with respect to $D_i$ is the number of occurrences of $w_j$ in $D_i$, divided by the total number of words in $D_i$. Note: tf is computed for a word and a document. The same word may have a different tf score for different documents.

If our document is “I really like Matt’s blog, it’s really cool”, the tf of “really” is $2 / 8$, because “really” occurs twice, and there are eight total words!

With idf, you want to lower the score of common words. Every document probably has words like “a” and “the” in it, so their term frequency is probably going to be super high compared to words that might be more representative of the document but occur less frequently. If we have $N$ documents and the word $w_j$ appears in $N_{w_j}$ documents, the formula for idf is $log(N / N_{w_j})$. As we can see, if the word appears in a lot of the documents, the denominator will be larger, resulting in a smaller fraction; we take the log of this number, and voila! Note, unlike tf, the idf for a given word is the same across all documents.

If our first document, $D_1$ is “I want a hat” and our second document $D_2$ is “Matt would like a snowboard”, the idf of “a” is $log(2 / 2)$, while the idf of “Matt” is $log(2 / 1)$.

Now, like the name implies, we compute $tf * idf$ to get our tf-idf score for the current word!

If we look at the previous example, where $D_1$ is “I want a hat” and $D_2$ is “Matt would like a snowboard”, we can easily compute the tf-idf of “want”. In the first document, its tf is $1/4$, while in the second, its tf is $0$. The idf of “want” in the context of all of our documents is $log(2 / 1)$. Since the tf-idf score is different for each document, we can represent the tf-idf of the word “want” as a vector [$tf-idf$(“want”, $D_1$), $tf-idf$(“want”, $D_2$)] = [$log(2) / 4$, $0$]. Nice! We just found a vector encoding of “want”!

## Spo-tfidf-y, Revisited

Ok, now that we’re tf-idf experts, we can apply this technique to Spotify songs! Let’s define “words” to be songs, and “documents” to be playlists. We make the assumption here that songs in the same playlist are similar.

I listen to a lot of Kanye, and am pretty familiar with his discography. I wanted to vectorize his songs and then cluster them. This clustering would tell me how his users group his songs, and one could define populist pseudo-albums based on these clusters.

To scrape the playlists, I used Spotify’s super generous API to query for things I felt would be related to Kanye’s music. I ended up scraping $1263$ playlists that contained the words “rap”, “hip hop”, and “pop” in the title; I only considered playlists from this query that had at least two Kanye songs in them. I figured I could get better results with more specific keywords, but they would bias the playlists they’re pulling from more – a query like “early 2010s hip hop” is Bound 2 lean towards Yeezus – so I tried to keep it as vague as possible.

Ignore this paragraph if you don’t care about my technical difficulties. Some other issues I came in contact with were re-released albums. There are four instances of The College Dropout on Spotify, along with many more duplicate albums, so I had to manually collect these albums and merge the duplicate songs. Another issue was the endpoint to retrieve albums by Kanye, which is how I defined “Kanye songs”, would not get me all of the duplicate albums. For instance, there are five releases of Kanye West Presents Good Music Cruel Summer, but only one where “Kanye West” is listed as the main artist. Because of this, I just tossed all results for this album, because there might be playlists that do have songs from KWPGMCS but are not technically “Kanye songs”, and I had already finished the super painful scraping process. Anyways, onto the cool stuff!

## The Cool Stuff

So finally, I got my really big tf-idf matrix – it had a dimension of 126x1263, because there were 126 Kanye songs and 1263 playlists I looked at. Each row was the vector of a song! Let’s call this whole matrix $M_{playlist}$. I also computed a sibling-matrix of size 126x9. Spotify has an endpoint to compute “audio features” of a given song, which includes information about the song’s acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence, and tempo. You can read more about what these features mean here – there’s even pretty little charts on how these features are distributed. We’ll call this sibling matrix $M_{features}$.

I used a couple of things for my data vis. One of the tools I used is called PCA, which looks at some fancy math to find a lower dimension linear combination of your data – PCA is guaranteed to have the most explained variance, which means that if you reduce an input of $n$ vectors to $m$ vectors, those $m$ vectors are the way to encode your data while retaining the most information!

I also used a pretty simple clustering method called KMeans that manages to group your data into $k$ distinct groups, or clusters. Here, I set $k$ to be 9, as Kanye has 9 albums (give or take). KMeans takes its $k$ cluster centers, finds the points that are closer to a given center than to any other center, and moves that center towards the average of those near points. This runs until the centers converge and stop moving.

I’ll explain each of the following graphs below. Before clustering anything, I normalized the input matrix.

1. For my first experiment, I simply clustered the rows of my $M_{playlist}$ matrix and plotted the results in this cool d3 thing I Frankensteined
2. For my second experiment, I appended $M_{features}$ to the end of $M_{playlist}$, clustered that, and plotted it again
3. I computed a matrix $M_{PCA playlist}$, which is the result of running PCA on $M_{playlist}$, keeping 9 components. Then I clustered $M_{PCA playlist}$ and plotted it
4. I appended $M_{features}$ to the end of $M_{PCA playlist}$, clustered that, and plotted it one last time

I actually had two other experiments where I ran sklearn’s feature agglomeration instead of PCA, but the results, specifically after clustering, were consistently similar.

Please take an intermission from this long af blog post by looking at these Gorgeous graphs.

## Pretty Graphs

To reiterate, these are clusterings of Kanye’s albums based, more or less, on how often songs are grouped together in the same playlist!

## Notes

One interesting but expected grouping is the clustering of skits. Cluster #5 from the first graph is made out of a lot of skits, along with some lesser known tracks, and cluster #5 in the second graph is purely skits, with no actual songs. The reason this second graph might have been better at finding the skit pseudo-album is because it considers factors like “speechiness” and “instrumentalness” – skits typically have a high score in the former and low score in the latter, which would help separate them from their non-skit counterparts. The PCA results also follow the trend of having a skit-based grouping.

Another common trend is clusterings of more mainstream Kanye songs – in the first and last graphs, Stronger, POWER, Heartless, and Gold Digger end up in the same clusters. This makes sense intuitively – people that listen to more mainstream Kanye probably have playlists with the mainstream songs, which leads to closer tf-idf vectors.

There’s a really heavy grouping based on album in all of the graphs. This is expected; people make playlists in the moment and don’t always update them years later when new albums come out.

In general, I think it’s worth noting that the album clusters are pretty sequential – usually The College Dropout isn’t getting clustered with Kids See Ghosts, but rather with Late Registration, while MBDTF and Yeezus are often grouped together. Poetically, 808s and Heartbreaks finds itself alone a lot of the time, which is pretty Bad News.

Scroll through the graphs and see what connections you come up with!

## More Results

As a side note, I also sorted songs by their tf-idf vector magnitude – theoretically, songs with higher tf-idf magnitudes are more “distinctive”. If we cared about popularity, we can just look at how many playlists each song is in, but that’s less interesting. Here are Kanye’s top 10 “distinctive” songs!

Interestingly, 4/10 of the songs are from MBDTF, which left me So Appalled. Old Kanye was mostly absent – The College Dropout had one song, while Late Registration and Graduation were missing. There was no Yeezus or TLOP either, which is strange since those albums were pretty Famous.

If we only rank songs based on how many playlists they’re in, we get the following results.

1. Stronger - 336 occurrences
2. Gold Digger - 333 occurrences
3. POWER - 296 occurrences
4. All Of The Lights - 189 occurrences
5. Can’t Tell Me Nothing - 186 occurrences
6. Good Life - 160 occurrences
7. I Love It (& Lil Pump) - 159 occurrences
8. Black Skinhead - 159 occurrences
9. Father Stretch My Hands Pt. 1 - 157 occurrences
10. Flashing Lights - 143 occurrences

Unlike the previous playlist, we have a lot of Graduation in this playlist, and Late Registration, Yeezus, and TLOP are present as well. However, there is no 808s & Heartbreak, which is Amazing.

Many people consider Stronger, Gold Digger, and POWER to be a lot more mainstream, which this data confirms. These songs occur in around 300 playlists, while the 4th place song, All Of The Lights, appears in 189 playlists, a big drop.

## “Transorthogonal Linguistics”

I used to go to these meetups in DC called Hack and Tell where people showed off cool things they made – one person I met, Travis, worked on a project called Transorthogonal Linguistics, which took two unit vector embeddings of words in n-dimensional space, drew a “line” between them, traversed that line by a fixed distance at a time, and projected the current point on that line back into the n-dimensional unit sphere. This approximation of traversing the perimeter of a sphere is a lot easier than actually tracing it – while pretty easy in 2D, actually tracing the path between two points on a sphere gets pretty nontrivial pretty quickly as dimension increases.

So what? What’s the use here? In a typical word embedding, we can use this technique to find a “path” between words, where the path represents words that slowly morph in meaning from the start word to the end word. Travis made a repo to let you experiment with this. Using sun as a start and moon as an end, you might get something like sun -> sunlight -> glow -> skies -> shadows -> horizon -> earth -> constellations -> Jupiter -> moons -> moon

I tried applying the same technique to Kanye songs! When using the original vectors alone, the results were pretty bad, mostly due to how sparse the matrices were. When I put in a start word and an end word, the projections were often closest to the start and end word, giving us a trivial path that simply contained our start and end songs. However, after applying some PCA to make the feature matrix a lot more dense, there were some interesting results!

When doing Love Lockdown to Lost In The World, the two intermediate songs were See You In My Nightmares and Paranoid, which showed a big skew to 808s.

I also wanted to know what happened if I took one of Kanye’s oldest songs, All Falls Down, and walked to one of his most recent singles, I Love It. This path went through The Glory, Dark Fantasy, Black Skinhead, I Thought About Killing You, and No Mistakes in that order. I think it’s neat that the walk from old Kanye to new Kanye went chronologically through his discography.

## Correlating Playlist Presence and Audio Features

I had one last experiment I wanted to try, which was to see if these vectors, constructed from user playlists, correlated to Spotify’s audio features. I actually found some high correlations! I reduced the dimensionality of $M_{playlist}$ using PCA and feature agglomeration, with reduced dimensionalities of 81, 27, 3, and 1. Feature agglomeration produced higher maximum correlations, except for in the case where the data was reduced to a single column.

I look at the correlations between a certain column and a given audio feature. That means that, in the case of a single feature vector, the correlations for the audio features will be unique – there will be only one correlation for speechiness. However, with higher dimensional components, we can have repeats. One column might have a high correlation for speechiness, and another column might also have a high speechiness correlation.

Also note: finding the correlations between the original matrix is not particularly interesting, as each column represents a playlist. Thus, a high correlation between a playlist column and an audio feature column doesn’t tell us much about the trends of Spotify users as a whole – it just tells us that there’s some playlist out there that correlates to high tempo music, which happens all the time. Also, before being reduced in dimensionality, the playlist columns are pretty sparse.

Last note: the absolute value of the correlation represents how predictive the feature is, and the sign of the number encodes whether the correlation is positive or negative.

Evaluating dimensionality reduction of size 1 using PCA

1. r = 0.69 for speechiness
2. r = -0.56 for loudness
3. r = 0.43 for valence
4. r = 0.4 for acousticness
5. r = -0.38 for energy

Evaluating dimensionality reduction of size 1 using Feature Agglomeration

1. r = 0.41 for energy
2. r = 0.4 for loudness
3. r = -0.39 for acousticness
4. r = -0.29 for speechiness
5. r = -0.14 for valence

Evaluating dimensionality reduction of size 3 using PCA

1. r = 0.69 for speechiness
2. r = -0.56 for loudness
3. r = 0.43 for valence
4. r = 0.4 for acousticness
5. r = 0.39 for energy

Evaluating dimensionality reduction of size 3 using Feature Agglomeration

1. r = 0.72 for speechiness
2. r = -0.57 for loudness
3. r = 0.41 for valence
4. r = -0.41 for energy
5. r = 0.4 for loudness

Evaluating dimensionality reduction of size 9 using PCA

1. r = 0.69 for speechiness
2. r = -0.56 for loudness
3. r = 0.43 for valence
4. r = 0.4 for acousticness
5. r = 0.39 for energy

Evaluating dimensionality reduction of size 9 using Feature Agglomeration

1. r = 0.72 for speechiness
2. r = -0.57 for loudness
3. r = 0.42 for energy
4. r = 0.41 for valence
5. r = -0.41 for energy

Evaluating dimensionality reduction of size 27 using PCA

1. r = 0.69 for speechiness
2. r = -0.56 for loudness
3. r = 0.43 for valence
4. r = 0.4 for acousticness
5. r = 0.39 for energy

Evaluating dimensionality reduction of size 27 using Feature Agglomeration

1. r = 0.72 for speechiness
2. r = -0.57 for loudness
3. r = 0.41 for valence
4. r = -0.41 for energy
5. r = 0.4 for energy

Evaluating dimensionality reduction of size 81 using PCA

1. r = 0.69 for speechiness
2. r = -0.56 for loudness
3. r = 0.43 for valence
4. r = 0.4 for acousticness
5. r = 0.39 for energy

Evaluating dimensionality reduction of size 81 using Feature Agglomeration

1. r = 0.72 for speechiness
2. r = 0.58 for instrumentalness
3. r = -0.57 for loudness
4. r = 0.52 for instrumentalness
5. r = 0.41 for instrumentalness

I think this is neat! We can see in some of these scores that humans choose songs in their playlists – at least, Kanye songs – based on speechiness, loudness, and instrumentalness, among other things.

As a side note, I also tried calculating these same correlations for the pure frequency matrix, instead of the tf-idf matrix, and it turns out that the tf-idf matrix consistently had higher correlation values, although they were only marginally larger. That means that, at least for estimating audio features of songs given the song’s vectorized form, the tf-idf vector is marginally more predictive.

Hopefully this has been a little enlightening! I have the code for these experiments in a very messy jupyter notebook if you wanna dig through it. Hopefully there were No Mistakes