The Statisticator: July 2012

Saturday, July 28, 2012

Improving odometer granularity

I used to have a very short commute by car but never gave it that much thought. However when my insurance asked me to estimate by yearly commute I was backed in a corner. It was only a few miles, but was it 2, 3 miles? Both numbers are small, but relative to each other that's a 50% difference!

And when multiplying by two trips each day and about 250 commuting days, that's quite a difference in total miles. I doubt that I would have benefited from a much larger rebate travelling 1000 or 1500 miles per year for commuting purposes, but it was too late, I needed to know.

The problem was not as simple as checking the odometer at the beginning and end of trip end of story. The issue was the odometer's granularity which only showed miles. Looking at the odometer more carefully over a few days showed that some days the odometer digit wheel would turn twice, sometimes three times. So what was my commute?

Intuitively it seemed like if I kept track of how many times the mile wheel turned each time and then averaged that out I could get an estimate with as much precision as I wanted. But was this intuition correct?

Well, let's say my odometer is at O + x miles, where O is the truncated mileage, and x is the fraction of mile started (x in [0, 1[). Looking at my odometer I only see O, but after (1-x) miles I will see O+1.
Let us also say that my commute is C + y where C is the truncated mileage and y the extra fraction mileage (y in [0, 1[).

So I originally see O miles on my odometer, and after my commute I will be at O + x + C + y miles, and the odometer will show:

O + C if (x + y) < 1
O + C + 1 if (x + y) >= 1

We can assume x to be a random uniform variable on [0, 1[. y however is fixed and part of the commute.

So my odometer will indicate a commute of C + 1when x >= 1 - y, and C otherwise.

Therefore, by averaging my estimated daily commutes I will get an expected value of:

(1 - y) * (C + 1) + y * C = C + y which is the true value I wanted to get to.

So averaging the bi-modal values of C and C + 1 will provide me an unbiased estimate of my true daily commute.

Of course I could also have gone back and forth an entire weekend dividing total trip commute by number of back-and-forths...

But it's not like I have time to lose ;-)

Friday, July 20, 2012

Can the lottery code be cracked?

I just came across this article about a Canadian who broke the Scratch Lottery Code.

The article s about Mohan Srivastava who came across two unscratched Tic-Tac-Toe tickets near his desk. Not a big lottery fan but with nothing else to do, he scratched them. Lost on the first, won $3 on the second. He went to cash in his winning at the nearby gas station, but all the while he started thinking about how these tickets are created. Because the lottery corporation needs to keep careful track of how many winning tickets get printed, the computers can't just generate random numbers onto each card, winning and losing tickets need to be carefully created while giving the illusion of being completely random.

The more he thought about it, the more he became convinced there could be a way to determine whether a ticket was a winning ticket or not without having to scratch it. Here's what a Tic-Tac-Toe ticket looks like:

You have 8 3-by-3 grids with visible numbers. You then scratch the 24 numbers on the left. If any of t3 of the 24 numbers are lined up in any direction in any of the 8 grids, you've won!

Well Mohan bought many tickets, scratched them all, and ultimately found a relatively simple rule to determine ahead of time whether a lottery ticket is a winning ticket or not.

The trick? In your lottery ticket you have 72 numbers visible in the grids (8 * 3 * 3). Because the numbers go only up to 39, so will have to occur multiple times. Mohan kept track for each number how many times it occurred on the ticket:

He was especially interested in singletons that appeared only once. His finding was that if three singletons lined up, then you had a winning ticket!

I am not a 100% about why that is and how Mohan came to this conclusion, but intuitively I think the algorithm works something like this (at a very high level)

randomly separate numbers from 1 to 39 into two sets, 24 singletons and the 15 duplicates
fill in the grids with duplicate numbers, and with at most 2 singleton numbers
list the 24 singleton numbers on the left hand side

This will guarantee that the ticket is a losing one. You can make it a winning one by adding three singletons in a row, column or diagonal in any grid. This will guarantee that the ticket is a winning one with only one gain.

This is overly simplistic but it does allow a simple way to generate the tickets. Creating random grids and then testing whether they are winners or losers before printing them would take to much time, the above process creates them in one shot without testing.

On the example given in the article we can see that the 24 numbers given on left come out a total of 35 times in the grid (average 1.45), whereas the other 15 numbers not listed come out 37 times (average 2.47). This hints that the duplicates are indeed something like background noise, and the 24 numbers selected on the left are actually much less likely to occur in the grid. And this is where we are fooled: If we reveal 24 numbers, we should find a lot more of them in the grid made out of 39 different numbers shouldn't we?

The article also talks about the lottery industry in general and how the mob uses the lottery for money laundering.

And as to why Mohan revealed the trick instead of making a fortune? He simply calculated that if he were to spending all his time identifying winning tickets in each store across town, scratching them, and redeeming them, he would make less than his current full time job!

All in all, a rather interesting read, although I already gave out the end...

Tuesday, July 17, 2012

Media Noise Unit

There was a recent French article in Les Echos on a new concept aimed at measuring media intensity on news events: Media Noise Unit (unité de bruit médiatique in French).

The article is in French so I'll only summarize the gist of it. Three doctorate students derived this new metric to measure the intensity of an event across all media types (TV, print, radio) and aggregated into a single value based on the audience reach. So in a given day the French Elections might have an overall intensity of 412 for instance. The number in itself isn't important, it's how it compares to other events that same day and over past periods that is.

The main learning from a wider analysis based on this new metric is easily summarized: modern media tend to spend more and more time on less and less topics, what the researchers labeled "media craze". In a given day, only a handful of events will have very disproportionate intensity compared to the other topics, and the trend has sharpened over the years. The top "noise-makers" today makes twice as much noise as the top ones from 5 years ago.

Noise units went through the roof in 2011

The article goes to describe the paradox behind this: in a world where access to information is getting wider and richer, we are being saturated with just a few events. However, the article goes on to note, humans have been proven to only remember up to three news events in a day, two of which they will forget within the next 24hours...

I found the article very interesting from a analytical point of view: I am especially fond of techniques that bridge the qualitative-quantitative gap. Being able to measure all the activity around a given topic seems like a very difficult task, but it enables many interesting insights to be driven afterwards. Comparing the intensity of the hotness topics over time is pioneering work!

But I am also a little skeptic about this technique and metric which seem almost to good to be true. Even a human would have a hard time labeling an article as being part of a certain topic or not, how well can this task be automated? During the French elections, where all articles where one of the candidates' name appeared be labeled as part of "french elections"? But what about routine actions the government did at that time, or visits abroad from the president? And certain events will trigger other more general analysis and essays that apparently have little to do with the original event. Are these counted or not?

I also saw no mention of online media which is a growing source of information. Restricting to TV, print and radio might be introducing some self selection biases to the analysis.

Anyway, even if not perfect there are some interesting insights. I would actually be very curious if these insights hold outside of France...