The Statisticator: May 2019

Tuesday, May 21, 2019

Solving Mathematical Challenges with R: #1 Palindromes

This post is the first challenge from what will hopefully be a long series!

You can find more details on where these challenges come from and how/why we will be approaching them in the original posting.

Challenge #1

Let's jump right in with the first challenge! You can find the original presentation of the challenge (in French!) here.

It goes as follows: a number is said to be a palindrome if the digits read left to right are the same as if read right to left. An example would be 151, or 70355307. The first question is simply to determine the number of palindrome numbers that have 351 digits. As a bonus question: what is the minimal difference between two palindrome numbers with 351 digits?

Naturally the first step is to get a better understanding of what we are dealing with, independently of how we want to tackle the problem.

Definitely seems like a small pattern is emerging! If we were to continue with the mathematical approach we would look at N = 4, possibly N = 5 to better understand the dynamics, and start separating the case of N even with N odd. Let's take the R route for now.

How can we generate/identify all palindromes with N digits?

One first approach is to generate all numbers of N digits, flip them, and keep those where the flip left the number unchanged:

PalindromesSlow <- function(n) {
numbers <- seq(10 ^ (n-1), 10 ^ n-1)
flipped <- as.numeric(sapply(
lapply(strsplit(as.character(numbers), split = ""), rev),
paste0, collapse = ""))
palindromes <- numbers[numbers == flipped]
return(palindromes)
}

Why the “Slow” in the name? While this function performs exactly as expected, it won’t scale well if we start looking at numbers 7-digit long and larger. The main reason is that we are generating every single potential number and only selecting an ever-decreasing fraction of those meeting our palindrome criteria. Why ever-decreasing? Let’s look at the trend for our first values of N:

N = 1: 9 palindromes, 9 total numbers: 100%
N = 2: 9 palindromes, 90 total numbers: 10%
N = 3: 90 palindromes, 900 total numbers: 10%
N = 4: 90 palindromes, 9000 total numbers: 1%

Because palindromes are determined by the first half of their digits, we shouldn’t try generating all possible numbers with N digits, but only all possible first halves, then mirror image them to generate the second half. For example, if we want to generate all 6-digit palindromes, let us first generate all combinations of 3-digits XYZ, then flip and switch to get XYZZYX. For odd number of digits, it’s only slightly trickier as the middle element acts as the mirror and has no constraints. For 7-digits we would simply generate the first 4 digits, but flip the first 3:
XYZT --> XYZTZYX.

length.firsthalf <- ceiling(n / 2)
digit.list <- rep(list(0 :9), length.firsthalf)
# remove 0 as option for first digit
digit.list[[1]] <- 1 : 9
# expand grid
eg <- expand.grid(digit.list)

A quick word on the expand.grid() function which has saved me on multiple occasions. It essentially generates all possible cross-combinations of arguments provided. Easier to see through a simple example:

expand.grid(c(1, 2, 3), c("a", "b"))

Var1 Var2
1 a
2 a
3 a
1 b
2 b
3 b

For our PalindromeQuick function it is incredibly valuable as we want something generic enough to adjust to any n. So we create a list of length n with all possible options at each position, and stick that into expand.grid and voila! No need for complicated nested for loops, especially as we wouldn't know how many for loops to nest in the first place!

We can then wrap that in a function to determine how many palindromes were found as well as minimal distance between consecutive palindromes. The only trick is carefully handling n being even or odd as discussed previously.

PalindromesQuick <- function(n) {
length.firsthalf <- ceiling(n / 2)
digit.list <- rep(list(0 :9), length.firsthalf)
# remove 0 as option for first digit
digit.list[[1]] <- 1 : 9
# expand grid
eg <- expand.grid(digit.list)
if (n %% 2 == 0) {
flipped.ending <- apply(eg, 1, function(row) {
paste0(rev(row), collapse = "")
})
} else {
flipped.ending <- apply(eg, 1, function(row) {
paste0(rev(row[- length(row)]), collapse = "")
})
}
eg$ending <- flipped.ending
palindromes <- sort(as.numeric(
apply(eg, 1, paste0, collapse = "")))
return(palindromes)
}

Having written a slow and quick function, we should pause to think about how much faster quick is, and talk about timing our R functions.

Timing in R

When I started using R, my trick for timing functions was simply to record the absolute time before and after execution and take the difference:

t1 <- Sys.time()
output.slow.7 <- PalindromesSlow(7)
t2 <- Sys.time()
t2 - t1
1.174394 mins

I later discovered it could all be done in one go with the built-in system.time() function:
system.time(output.slow.7 <- PalindromesSlow(7))
user system elapsed
68.348 0.544 69.016

They each have small caveats one should be aware of in how they track elapsed time, but for most timing purposes return exactly what you care about. system.time() has the advantage of being a one-liner, but the former solution is easier to use if you have multiple lines you want to time and also makes it easier to comment in/out of code.

From now on we'll solely focus on the Quick version, a 600X improvement is definitely something we'll take!

A quick comment on efficiency and speed: it's always important to weigh the pros and cons. While a function that runs faster always seems preferable on paper, one must also consider the necessity of speeding in the first place, as well as the additional complexity of speeding things up. Here the quick version was just a few lines longer and simple to understand and the increase in speed will allow us to explore numbers that are 14 digits long as quickly as the slow version took to explore 7-digit numbers. However, this is not always the case, we'll see this in challenge #3 where trying to be efficient is not going to be straightforward to implement and scale, and the gain might still be insufficient to explore the higher dimensions.

Results

Using our PalindromeQuick function, let's look at the number of palindromes and minimal distance between consecutive palindromes:

N	Number of Palindromes	Minimum Difference
1	9	1
2	9	11
3	90	10
4	90	11
5	900	11
6	900	11
7	9000	11
8	9000	11
9	90000	11
10	90000	11
11	900000	11
12	900000	11
13	9000000	11
14	9000000	11

Although we haven’t provided a formal mathematical proof, I think we at least have a good idea of the challenge's answers.

If n is even, we have as many palindromes as there are combinations of n / 2 digits with the first one not being a 0, that’s 9 * 10 ^ (n/2 - 1).
If n is odd, we have as many palindromes as there are combinations of (n + 1) / 2 digits with the first one not being a 0, that’s 9 * 10 ((n + 1) / 2 - 1).

Minimal difference is somewhat trickier...

Based on the simulations, minimal value fluctuates a bit at first, but definitely seems to quickly stabilize at 11 for numbers with 4 or more digits.

It’s an interesting problem: if we want to make a small change to an existing palindrome to create the next small palindrome, we would typically want to make small changes to the right hand side of the number (units, tens...). But because of the palindromic nature of our numbers, small changes to the right will lead to the same change on the far left and thus a huge delta between our two numbers. The obvious option is to aim for the middle, but that’s still an absolute change of the order of 10 ^ (n / 2).

If we look at examples with small n, we see we can do better when we have nine(s) in the middle: 393 and 404, 5995 and 6004. This is basically our initial trick of incrementing the “middle” but after 9 we can 0 and need to increment the digits to the immediate right and left: 37973 increments to 38983 (delta of 1010), but 39993 increments to 40004 (delta of 11).
So we've reached the point where we've proven the delta is no bigger than 11 as X99...99X (with X any digit between 1 and 8) and (X+1)00...00(X+1) will have a delta of 11.There will be 8 such instances, no matter the number of digit of the palindrome. Proving that 11 is actually the best we can do is a much tougher question.

Back to the challenge, we can quite confidently answer the first challenge that there are 9e175 351-digit palindromes, and that minimal difference will be 11.

Before we sign off, let’s finish with a small plot of the distribution of differences between consecutive palindromes. Given our previous comments and at a very high-level, we should expect a big spike around 10 ^ (n/2), then a spike about one-tenth of that for 10 ^ (n / 2 - 1) whenever the middle number is 9, then a spike one-hundredth smaller at 10 ^ (n / 2 - 2) whenever the middle number is 999 and so on. And as noted previously, 8 cases with a delta of 11.
Turns out we weren’t too far off !

For N = 13:

For N = 14:

For the graphs, despite the numerical nature of the x-axis, we naturally went for barplots instead of histograms given the very peculiar discrete distribution!

Solving Mathematical Challenges with R

In a new series of posts, I will explore how the free statistical tool R can be used to answer mathematical challenges.

Le Monde is a French daily newspaper with one of the highest circulation numbers. There used to be a small section at the very end which varied by day of the week containing a small intellectual puzzle, either a crossword, a Sudoku, a math challenge...
I haven't lived in France for quite some time, but checking online recently it appears they have slightly changed the format, adding mathematical challenges which are presented by Cedric Villani, French Mathematician and recipient of the 2010 Fields Medal (shameless fun fact, his PhD Advisor Pierre-Louis Lions and also recipient of the Fields Medal in 1994 was my math teacher during my Master's).

While these weekly challenges are mathematical in nature and can be (should be!) resolved with pen and paper, I will take a computer-oriented approach, namely with the R programming language for statistical computing. The idea is not necessarily to fully solve the challenge (computationally or not), but indirectly use these challenges as R tutorials, dig deeper into the mathematical challenges along the way, and with a focus on visualisations. Because of the focus on R, we'll also take slight detours to look into R performance, best practices, try different packages and approaches.

Think of this as a road trip, we kind of know where we want to go, but have no idea if we'll reach our destination, and much less certain of the path we'll take, but whatever the path we'll see some really cool stuff along the way!

Challenge #1: Palindromes

Challenge #2: Slicing Cubes

Challenge #3: Sum of all grids

Challenge #4: Match-ing game (coming up soon!)

Challenge #5: Winning ropes (coming up soon!)

Saturday, May 4, 2019

Blazers-Nuggets quadruple OT: Can't help but wonder...

...how last night's quadruple overtime game would have ended had it been played in Denver instead of Portland.

It's pretty common knowledge that home-court advantage is very present in the mile-high city (something I had looked into a few years ago on this very blog), so seeing how tired even some of the Nuggets players, I really wonder what a full hour of play in high altitude would have done to the Blazers team.

And because this is primarily a stats-oriented blog, we should definitely point out the reporting issue due to the NBA not being used to players logging over 59 minutes and 59 seconds of play:

Quite an impressive stats line for a single minute of play!

Thursday, May 2, 2019

Extraordinarily ordinary (a.k.a. the "11-year NBA Playoff Upset Cycle")

Every year like clockwork, the NBA playoffs are upon us, historically one of my favorite topics to post about!

As I was comparing my predictions with what has taken place on the courts these past few weeks, I realized two things.

First: had the Spurs won that Game 7 against the Nuggets, I would have had all first round predictions correct.

That in itself isn't really jaw dropping, there are only 8 match-ups in the first round, and mostly of them are highly predictable. NCAA March Madness in comparison has 32 first round match-ups.

It wasn't jaw-dropping for another reason. (Let me give you a hint: my young kid who knows nothing about the NBA got all predictions for the first round correct)

That's right, simply by predicting the higher seed would win the match-up was sufficient to get the entire first round perfectly! Not a single upset so far! We sports-lovers have been derived of that special moment when the young challenger topples the giant. Ah Spurs, you were so close!

And here is where things get interesting: the last time there weren't any upsets in the NBA playoffs was... in 1974!!! The second year the NBA was created.

Now granted we're only halfway through the playoffs and upsets can still happen. But restricting to first round only, the last year we had no upsets was in 2008, over 10 years ago. Historically, upset rate is just above 25%, so we should already have expected about 2 so far in the first round!

We can also notice incredibly high rates in the 70s and early 80s, that's essentially due to the fact that Playoffs were considerably shorter then (less teams so no first rounds, and certain rounds were best-of-3 providing a much better opportunity for challengers).

While we're on the topic, here are some other stats on upsets.

Upsets seem to be more common in the Western conference than in the Eastern conference, but the difference is not significant:

It would also seem upsets are more likely in the later stages, so all hope is not lost! (but also to keep in mind is that there are less match-ups in the later stages so we would still have expected more upsets to have already occurred in the first round).

And while challengers are rarely able to repeat their herculean effort in the next round, we should definitely give yet another shoutout to the 1999 New York Knicks who despite being 8th in the East went all the way to Finals, losing to the Spurs in 5 games.

Before we end, I wanted to point out to a rather bizarre observation. Looking at the fraction of upsets per year, we notice dips every now and again. If we look at the exact years when there were less than 10% of upsets, things get a little freaky: 1974, 1986, 1987, 2008, 2019. Aside from 1974-1986 being a 12-year gap, all others are perfect 11-years gap! Like clockwork, we were doomed to have little-to-no upsets this year! We're still in the middle of the playoffs so things get still change quite considerably....