(This is part 1 of a series of posts on iTunes randomness. The introduction and a link to all parts can be found here.)
Before jumping into the math, let's clearly define the terms and notations that will be used throughout the post(s).
Let N be the number of songs (all different) in our iTunes playlist.
Let p be the number of songs played in a sequence.
Let k=p-N. k will have an interesting interpretation later on and will be our parameter of interest.
A song order is any sequence of songs.
A song order of length p is any sequence of p songs.
A successful song order is a song order that satisfies our requirement that each song has been played at least once, and which stops as soon as this criteria has been met.
A successful song order of length p is a song order of length p that satisfies our requirement that each song has been played at least once, and the criteria is met when the p-th song is played.
P(p|N) is the probability that with N songs a successful song order will have length p. As mentioned previously, we will later switch to the notation P(k|N) = P(p-N|N) for ease of interpretation.
Example: Suppose we have N=4 songs.
A possible song order of length 7 is: 3-2-2-3-4-1-1. This particular song order is not a successful song order of length 7 as the criteria that each song has been played at least once is met when the 6-th song is played, the elusive song number 1.
A successful song order of length 7 would be: 4-1-1-4-2-1-3.
An obvious but important fact to notice is that in a successful song order of length p, we necessarily have p ≥ N. If we want to listen to N songs each at least once, at least N songs will need to be listened to! Recall that we defined k=p-N, so in successful song orders, k ≥ 0. k can also be interpreted as the number of times a song already listened to is repeated. Taking our previous example: 4-1-1-4-2-1-3, N=4, p=7, and so k=3, which accounts for the two times track 1 and the one time track 4 were re-listened to.
We are assuming that each pull is completely random, that is to say that each ball has an equal probability of 1/N to be pulled, even the one just pulled. Now this assumption is not exactly true in the iTunes world as discussed here, here and here, but is nonetheless a reasonable assumption for our problem.
Some simple R code can generate music-listening simulations, random sampling songs until all have been heard at least once. For each number of song in the playlist ranging from 1 to 1000, 2000 simulations were performed to estimate the average number of total songs played to make the order successful. Results are presented in the following curve with red dots representing simulation results and the blue curve a fitted smooth line.
The curve is extremely smooth, and it very roughly seems like you need to hear about six times as many songs as you have in your playlist to get to listen to each song at least once! The curve hints at some convex curvature which indicates taht as the size of the playlist increases the factor might also increase.
Taking a closer look at the ratio of total songs played over total songs in the playlist, we observe a highly interesting pattern as shown in the next plot.
The curvature observed in the first curve is confirmed with the plot in the second curve, but what is most interesting is whether the ratio is bounded and will reach a final upper bound or whether it will continue to increase to infinity.
Another plot that we can look at but the analysis of which will not be covered in this paper is the distribution of plays for all the songs. In other words, once every song has been heard at least once, how many times has each song been played?
The distribution in the previous plot has some interesting characteristics. For any number of songs N, the minimum is always 1. This is because our song ordering ends when the last elusive song has been heard, so the last song played will have be heard once and only once. The mean and median are very close with the mean slightly higher. This observation makes sense given the distribution is lower-bounded by 1 but has no upper-bound (one song theoretically could be played any number of times before we obtain a successful song ordering). As for the median, the curve is actually the ratio we had observed in our second curve, always increasing but at slower rates and seeming to hover at around 6-7 when we reach 1000 songs in our playlist.
Reverting back to our original question of number of song plays required, the problem seems like a rather straightforward one at first (we have all done our share of blue and red ball pulling from urns at some point in a stat classe). If that is your feeling as well, we strongly urge you to attack the problem with the usual techniques, it will probably help better understand the approaches and mindset detailed in the rest of this paper.