Tuesday, December 17, 2019

Can the 2019-202 Lakers get 70 wins?



I know I know, it's way too early to have this conversation.
If we're already having it with less than a third of the season under our belts, can you imagine what it will be like for the rest of the way? "70" might be a top Google search in 2020!

That being said, the conversation has already started, here's the record for the '71 Lakers, blablabla, very similar, blablabla... So it's difficult to block it out entirely and as a big fan of visualization, even more difficult not to add a couple of visuals to avoid all the number comparisons we're seeing.

Very quick background: the NBA regular season consists of 82 games. 70 was a mythical number of wins no team seemed able to reach, the Lakers getting the closest with 69 in 1972. That all changed in 1996 when Michael Jordan's Bulls went on to win 72. Twenty years later, the Golden State Warriors were able to inch just a bit further with 73 wins which is the record as of today. (People will routinely point out that while the Bulls won the championship that year, the Warriors lost the Finals in 7 games...). The year after they got 72 wins, the Bulls were close to repeating the 70-win feat but lost their last two games and ended at 69. They did however win the championship again that year.

This year, the Los Angeles Lakers are off to a really hot start, and after the Bucks' loss yesterday, lead the NBA with a 24-3 record. This hot start has naturally fueled the 70-win conversation, so how do all these teams stack up at this point of the regular season?

Here's a game-by-game evolution of those four teams' win percentages:



The gray horizontal line is 83.4% representing the 70 win threshold. What really stands out here is the overall downward trend as the season progresses. It took a while for the '71 to get above the 70-win threshold , but once above it seems a very difficult level to maintain. No team was just under the limit to finally make it above in the final stretch of the season.

While the Lakers are currently tied with the '95 Bulls, '96 Bulls and '71 Lakers, they will need to keep accumulating wins at a higher than 83.4% win rate to provide an acceptable cushion for the downward end of year trend.

As for the rationale behind that downward trend, it could be due to fatigue, although one could argue that all teams should be similarly affected, but more likely it is due to final Playoff standings starting to fit in. The 7th, 8th, 9th and 10th seed are all fighting for the last two Playoff spots and will do whatever it takes to win.

It could very well be that LeBron and company will have to make some late season decisions as to whether they want to pursue the 70-win mark or maximize their chances for title. Officially they'll declare the latter as being their priority, but we all know that if LeBron can add a 70-win season to his resume he wouldn't spit on that. He also knows first-hand the risk associated with chasing the wrong priority, as his team was the one who beat the 73-win Warriors....

Tuesday, December 3, 2019

Lego: Christmas gift without bricking the bank



Four years ago right around Christmas time, I had posted an analysis of how the prices of Lego sets were related to their number of pieces.

Checking a Lego Christmas catalog last week, I had the feeling that:
  1. prices had gone up in general
  2. yet the 10-pieces-per-dollar ratio I had estimated 4 years ago still appeared to be valid.
This conclusion would then naturally be that the number of pieces had also gone up.

I pulled all the prices and number of pieces by Lego theme (over 40 nowadays!), and revisited the analysis.

The data needed some clean-up again, especially regarding sets with with abnormal prices for the number of pieces: the Duplo sets (oversized blocks are much more expensive), the Mindstorms sets (the color sensor is over $40) and Powered-Up sets (the move hub is $80). Similarly to last time, I also split themes into those that are based on an external partnership (Star Wars, Harry Potter, Marvel, DC...) and those entirely developed by Lego (City, Friends, Creator Expert...).

Here's the plot of pieces versus price for all sets less than $200 (95% of sets), and removing the outliers mentioned in the previous paragraph:


It's quite tempting to look at the range for the pieces-to-price ratio across sets, splitting by theme:


Out of the top 10 themes with best median ratio, 9 are themes Lego developed in-house, and only one is through external partnership (car makers, deal might not have been as lucrative as for Marvel or Star Wars rights).

We can fit a linear model to fit the number of pieces against price, partnership, and their interaction:


The result is that for non-partnership sets, we get a pieces-to-price ratio of approximately 10.7, ratio that drops to just over 9 for the partnership sets. The difference is highly significant and confirms our previous observations. The intercepts are slightly different but not significantly so.

A question we might be interested in is whether we observe any diminishing or increasing returns as the prices go up. This can be done by observing the pieces-to-price ratio for various pice buckets:

As we might have expected, our pieces-to-price improves as we go into the more expensive sets. This makes sense as the pieces are only one of the costs for sets, there are many other fixed costs in terms of packaging and marketing that will make the smaller sets less attractive to the smart buyer.
Non-partnership sets appear to always outperform the partnership sets, but one should keep in mind that the sample size is pretty small in the $200+ bucket.
The previous result also throws some shade on our linear models, the slopes (pieces-to-price ratio) are actually not constant. If we wanted to improve the models, we would probably need to include a quadratic term for price.

So to recap, your best pieces per dollar will come from non-partnership and bigger sets. Of course it could be your child has a specific theme in mind, so to help out with your Christmas purchases, here's the list of the best sets for each theme:

ThemeProduct NamePiecesPriceRatio
Architecturetrafalgar-square119779.9914.96
Boostdroid-commander1177199.995.89
Brickheadzthanksgiving-scarecrow1779.9917.72
Citypizza-van24919.9912.46
Classicbricks-bricks-bricks150059.9925.00
Creator-3-in-1deep-sea-creatures23014.9915.34
Creator-Experttaj-mahal5923369.9916.01
Dc1989-batmobile3306249.9913.22
Disneythe-disney-castle4080349.9911.66
Disney-Frozen-2elsa-s-magical-ice-palace70179.998.76
Duplocreative-fun12039.993.00
Fantastic-Beastsnewt-s-case-of-magical-creatures69449.9913.88
Friendsunderwater-loop38929.9912.97
Harry-Potterhogwarts-castle6020399.9915.05
Hidden-Sideshrimp-shack-attack57949.9911.58
Ideaslego-nasa-apollo-saturn-v1969119.9916.41
Juniorscity-central-airport37649.997.52
Jurassic-Worldjurassic-park-t-rex-rampage3120249.9912.48
Lego-Batman-Sets1989-batmobile3306249.9913.22
Lego-Spider-Manspider-mech-vs-venom60449.9912.08
Marvelthe-hulkbuster-smash-up37529.9912.50
Mindstormslego-mindstorms-ev3601349.991.72
Minecraftthe-wool-farm26019.9913.01
Minifiguresmf-set-ninjago-20195912.994.54
Ninjagoninjago-city4867299.9916.22
Overwatchdorado-showdown41929.9913.97
Power-Functionslego-power-functions-train-motor713.990.50
Powered-Updisney-train-and-station2925329.998.86
Powerpuff-Girlsmojo-jojo-strikes22829.997.60
Serious-Playwindow-exploration-bag4900484.9910.10
Speed-Championsformula-e-panasonic-jaguar-racing56529.9918.84
Star-Warsyoda177199.9917.71
Stranger-Thingsthe-upside-down2287199.9911.44
Techniccherry-picker1559.9915.52
The-Lego-Movie-2pop-up-party-bus102479.9912.80
The-Lego-Ninjago-Movieninjago-city4867299.9916.22
Toy-Story-4woody-rc699.996.91
Unikittyunikitty-cloud-car1269.9912.61
Xtratraffic-lights463.9911.53

Now one important caveat should be mentioned here. Number of pieces is a great proxy for size and time to build, but it isn't perfect. Going through some sets I've noticed a tendency of adding more and more small decorative pieces. This trick can significantly and artificially bump up the piece number. Ideally we should also include weight to control for that aspect, but weight was not available and would probably not have been available at the desired granularity level. Take the new gingerbread house for instance. At $99 for 1477 pieces it seems like a great deal, but take a closer look at what's inside:

Look at all those small pieces! Especially the red and white candy canes around the doors and windows. Just tiny 1-by-1 round shapes staked together. Almost 100 of them, and I doubt they're worth $10!

Now of course, these were just models, and you can throw them all out the window when it comes Ole Kirk's house. Who? Ole Kirk as in Ole Kirk Christiansen, the man who created The Lego Company. His house is an extremely rare Lego set which was gifted only to employees (and a few special visitors). At 912 pieces we would roughly estimate it at $90, but good luck finding it at less than $300, prices of $500 are not uncommon for unopened sets!


Thursday, November 21, 2019

Twitter, Trump & the Economy



Mentioning Twitter and Trump in the same sentence has become a lapalissade over the last few years.

It has reached the point where each new tweet is logged, analyzed, correlated. For a full comprehensive archive, I strongly recommend the Trump Twitter Archive. They're all there available for any analysis you can think of!

I recently stumbled across multiple articles on how the number of tweets from Trump were negatively correlated with stock market performance. A few examples:


The papers don't say it explicitly, but highly suggest causality, something we'll get back to later.

Reading all this, I decided to take a quick stab and see if:

  1. there were any interesting trends in Trump's Twitter usage that hadn't already been reported
  2. could replicate some of the stock market findings

Twitter activity

Much has already been said on Trump's Twitter patterns, and using activity as a proxy for determining when he sleeps, here are heatmaps for orginal tweets, retweets, total tweets and fraction of retweets:


Very dark from midnight to 6am as one would expect. We can also spot some sweet spots for original content on Monday mornings to kick off the week, and that same day right before midnight (lighter color zones in the upper right side graph).

I did want to explore the evolution of the fraction of retweets over time (# retweets / # total tweets). Very noisy data as one can imagine so slapped a smooth curve on top and the result is quite striking:


I'm guessing the retweet data wasn't collected prior to 2016, but since being elected there has been a clear upward trend.
It could be a sign that less time is available to generate a tweet from scratch and retweeting is a simpler way of manifesting presence, or it could be a way of building support ahead of the upcoming elections. It will definitely be interesting to continue monitoring this metric over the next few months and see if we ever cross the 50% mark.


Stock market performance

The original intent was to re-run the analysis mentioned ion various articles about the negative correlation between number of tweets and stock market performance.
The articles were not entirely clear on what data was being used, the best description I could get was:
"since 2016, days with more than 35 tweets (90 percentile) by Trump have seen negative returns (-9bp), whereas days with less than 5 tweets (10 percentile) have seen positive returns (+5bp) — statistically significant."

So time period will be Jan 1st 2016 until today (a little more data than what was originally used). In terms of tweets, it is not entirely clear if retweets are included or not, I will test both alternatives. As for the returns, I will look at three market indicators: Dow Jones, S&P500 and Nasdaq. Two more alternatives were possible based on whether we inspect changes in market performance in terms of absolute or relative changes.

Similarly to the quote, I identified in each case the 10th and 90th tweet percentile, and ran a simple t-test for market performance on days with less than the 10th percentile versus days above the 90th percentile. Unfortunately, in none of the 12 cases was I able to identify a significant relationship between Trump's Twitter activity and market performance:



There could be multiple reasons why I wasn't able to replicate the findings, from the data source (a good indication of this comes from the fact that my 10th and 90th quantile are lower than the values reported in the articles) to the formatting (I used Eastern time as the reference time but using a different time zone could result in tweets being shifted to another day) to the technique used for detecting a difference (though I also ran other non-parametric tests and results did not change).

One could be surprised by the lack of significance when seeing that Nasdaq has an average uplift of 0.2% on low tweet days but only 0.06% on high tweet days, but this is easily explained when visualizing the actual distributions and large standard deviations:



Independently of all these potential causes though, I feel that more importantly than the result is the interpretation of the result.


Correlation VS Causation

Let us assume we had surfaced an incredibly tight correlation between the two metrics: the number of tweets by Trump and market performance. What would that have meant, if anything?

When correlation between two variables A and B is observed, a lot could be going on behind the scenes:
  • A and B have nothing in common and the observed correlation is purely coincidental. This is called a spurious correlations. One of my favorite examples is the correlation between "the number of people who drowned by falling into a pool" and "the number of movies Nicolas Cage appeared in" (example taken from https://www.tylervigen.com/spurious-correlations which has lots of other great ones!)



  • It might seem that A causes B but it could actually be the reverse. For instance, it could seem that people going more frequently to the gym have lower BMIs, but perhaps those with lower BMIs tend to go more frequently to the gym...
  • The most vicious situation is that of the confounding or hidden variable that has a causal impact on both A and B separately. We might see a link between carrying a lighter and risk of lung cancer, but neither affects the other, the real hidden variable is whether the person is a smoker or not.

Back to the tweets. Assume there is a strong significant correlation between volume of tweets and market performance, one could argue there are other explanations than sheer volume creating market panic. Perhaps it is the reverse effect: when markets perform poorly, Trump is more likely to tweet and attempt to re-assure. Or there could be hidden variables in other economical events: if there are trade tensions with other countries, Trump could be more likely to tweet about those while in parallel the market tends to panic and go down...

Anyway, this was just a little tweet for thought...

Tuesday, September 10, 2019

How elusive was 42 as the sum of three cubes?

The big mathematical news September 9th 2019 was the final resolution of the decade old Diophantine equation.

As is often the case for super tough questions, the question itself seems completely trivial.
Can all the numbers from 1 to 100 be written as the sum of the cubes of three integers,
i^3 + j^3 + k^3?

Of course the key part of the question is integers, otherwise 0^3 + 0^3 + (cubic root of x)^3 will return x for any x we want!

The other subtlety is that integers can be negative, so each cube can be either positive or negative. So 92 can be written as 9^3 + (-8)^3 + (-5)^3.



This puzzle has been around for at least decades in this current form, but similar equations with integer solutions have been around at least since the third century!

Four our problem, solutions for almost all numbers 1 through 100 had been found, but two remained elusive 33 and 42.



And then 33 was found earlier this year:
   (8866128975287528)^3
+ (−8778405442862239)^3
+ (−2736111468807040)^3 = 33

Which left 42, but advances on the theory and on the computing side have cracked the last remaining elusive number:
   (80435758145817515) ^ 3
+ (-80538738812075974) ^ 3
+  (12602123297335631) ^ 3 = 42

Now I will describe man algorithm to generate all numbers 1 through 1000....
Just kidding.

I was actually curious of the elusiveness of these numbers 33 and 42. Mathematicians had to go really far to get the right numbers. Is that because numbers just weren't very likely to fall in the 1 to 100 range (very likely given how quickly cubes get!) or are certain numbers much more likely to get all the attention. in other words, as all combinations of three integers were tested, were all numbers 1 through 100 obtained in roughly equal proportions or did frequency concentrate around certain ranges?

I wrote a small script which explored this, simply generating all possible combinations of three integers (up to -1000 to 1000 in range, didn't feel like going all the way up to 80538738812075974...) and look at the distribution of i^3 + j^3 + k^3 in the 1 to 100 range.

With i,j,k in -10 to 10


With integers just in that small range, already 61 out of the 100 numbers are solved!
The distribution is far from being randomly uniform though! Two things jump out:

  • the spikes: These are perfectly normal! If a number is a perfect cube (i.e. 27), then there is an infinite number of solutions setting the two other values to opposite values: 27 = 3^3 = 3^3 + 0^3 + 0^3 = 3^3 + (1761)^3 + (-1761)^3... And you can add in all the permutations!
  • the clusters: it seems the solved numbers go in groups. Again, quite reasonable if we think about it. If a number can be generated as the sum of two cubes, then we can easily generate the one immediately before and after by adding either 1^3 or (-1)^3. We have an extra degree of freedom around the perfect cubes. Being able to generate consecutive solved numbers creates the clusters. There is a further discussion at the end of the post on how unexpected the number of clusters is.

Based on the spike comment, I will truncate at frequencies of 10 to focus on coverage:


With i,j,k in -100 to 100

Expanding the range ten-fold leads to 68 out of the 100 being solved:




With i,j,k in -1000 to 1000

Increasing the range ten-fold once again leads to a single additional number being solved: 51


We now have a better sense of the elusiveness of the solutions. If they aren't found with the range of the smaller integers, then it will get increasingly difficult to identify later, no wonder this has been such an open problem for so many years!

Just for fun, I looked at a new problem switching the power from 3 to 5, only 13 (compared to 69 for cubes) numbers were solved in the -1000 to 1000 range:


Again the clustering is pretty remarkable! While 42 is still unsolved, I'm glad to report a solution for 33: 0^5 + 1^5 + 2^5 !

And to close out this post, I wanted to return to the topic of the clusters. We saw that when exploring the -1000 to 1000 ranges for the integers we were able to cover 69 out of the 100 target numbers. But it didn't appear as though the 69 numbers were randomly distributed in 1-100 but grouped into clusters, more specifically into 13 clusters. We had a high-level explanation for this as consecutive numbers can easily be generated via k^3 with k = -1, 0 and 1. But the question remains, how non-random is this grouping into 13 clusters?

One approach is to sample without replacement 69 out of the 100 first integers, and see how many clusters are obtained. In order to count clusters, I used the rle() function which stands for run length encoding and basically describes a sequence. E.g. for 1,2,2,2,1,0,0 it will return that there is one 1, three 2s, one 1 and two 0s. From there it is easy to get the number of clusters.

I generated 10,000 samples of "solved" numbers between 1-100 and in each case computed the number of clusters obtained. Here is the associated histogram to which I've added the 13 we obtained in our case as a red vertical line:


While we'd typically expect to get about 20-25 clusters, only in extreme cases do we get a 15 or even a 14. Our 13 was not obtained a single time in all ten thousand simulations. This should settle the question of whether the low number of clusters comes as a surprise or not!



Friday, August 30, 2019

Disney vs Fox: Movie Performance Comparison

A few weeks ago 'Dark Phoenix' was a very hot conversation topic in the movie business, not for the movie itself, but because it symbolized the difficult Fox acquisition by Disney.


While Disney movies have done fairly well in 2019 to put it lightly (5 movies generating over $1Billion so far, and it's only August with "Frozen 2" and "Star Wars 9" still due....), Fox is under very tight scrutiny and 'Dark Phoenix' has been heralded as the barometer for Fox movie success/failure. Especially after Disney CEO Bob Iger declared during quarterly earnings call that “the Fox studio performance … was well below where it had been and well below where we hoped it would be when we made the acquisition.

While technically profitable thanks to higher-than-expected international results (latest IMDB results have $250M gross worldwide revenue with analyst forecasts around $280M for an estimated budget of $200M), 'Dark Phoenix''s performance will continue to be observed and scrutinized.

But this is just one movie we're talking about! Is it really fair to reduce a movie studio the size of Fox to a single movie? 20th Century Fox has been around for a while, and I was curious to compare their historical performance compared to Disney's.
Image result for data data data

First things first, data is required! I pulled all of Disney's and Fox's movies from IMDB, including key metrics in the process: IMDB rating, metascore, duration, genre, estimated budget, gross revenue...

Naturally some filtering was required to focus solely on non-short non-documentary movies released after 2000 with non-missing data.

In addition to Fox and Disney movies, I created a third category for movies produced by both.

The next plot compares the performance of each of the 512 movies that matched are criteria (331 Fox, 181 Disney), the x-axis is the worldwide revenue, y-axis is IMDB rating. Bubbles are color-coded by company and the size of the bubble is proportional to the movie's ROI (gross worldwide revenue / estimated budget):


A log-transform of the x-axis provides more insight into the lefthand-side blob, but loses the magnitude of the super high-performing movies:


It would appear that Disney dominates the high revenue end of the spectrum, with 22 out of 23 of movies generating over a billion dollars coming from Disney, the sole exception being 'Transformers: Age of Extinction', things are much more murky under that threshold.

Given the difficulty of separating the two classes I tried fitting an XGBoost model which was capable of classifying movies from each company solely based on IMDB rating, metascore, budget, world revenue and movie duration with over 75% accuracy on the test dataset, pretty remarkable, wouldn't you agree? Although to be fair, a simple logistic regression was capable of achieving a reasonable 65% accuracy.

It'll be interesting to see how the next wave of Fox movies perform, perhaps attempt a before/after comparison of the Disney acquisition...

Image result for avatar 2



Friday, June 28, 2019

Solving Mathematical Challenges with R: #3 Sum of all grids

Welcome to our third challenge!

You can find more details on where these challenges come from and how/why we will be approaching them in the original posting.

Challenge #3

The original link to the challenge (in French) can be found here. The idea is to consider an empty three-by-three grid, and then place two 1s in any two of the nine cells:


The grid is the completed as follows:
  • pick any empty cell
  • fill it as the sum of all its neighbors (diagonals as well)
I've here detailed an example, the new cell is indicated at each step in red:



The challenge is now to find a way to complete the grid in such a manner as to obtain the greatest value in a cell. In the previous example, the maximum obtained was 30, can we beat it?

Power grid
I strongly encourage you to try out a few grids before continuing to read. Any guesses as to what the greatest obtainable value is?

As you try it out, you might realize that the middle cell is rather crucial, as it can contribute its value to every other cell, thus offering the following dilemma:
  • fill it too soon and it will be filled with a small value
  • fill it too late and it will not be able to distribute its value to a sufficient number of other cells
Gotta try them all!
In the spirit of our previous posts, let's try all possibilities and perform an exhaustive search.
The first thing we need to ask ourselves is the number of possible ways of filling up the square, and then generating all those combinations in R.
Ignoring how the cells are filled up, there are 9 possibilities to fill up the first cell (with a 1), 8 possibilities for the second (also a 1), 7 for the third (with the rule)... and 1 for the ninth final cell. Which amounts to 9*8*7*...*1 = 9! = 362880, piece of cake for R to handle.

Generating permutations
So we want to generate all these possibilities which are actually permutations of numbers 1 through 9, where each number corresponds to a cell. {7,1,5,2,4,9,3,8,6} would mean filling up cells 7 and 1 first with a 1, then filling up cell #5, then #2 and do on.
A naive method would be to use expand.grid() once again for all digits one through 9, which would also give unacceptable orders such as {7,1,5,2,1,3,6,8,9} because cell #1 is filled twice at different times! We could easily filter those out by checking if we have 9 unique values. But this would amount to generating all 9*9*9..*9=9^9>=387 million combinations and narrow down to just the 362K of interest. That seems like a lot of wasted resources and time!

Instead, I approached the problem from a recursive perspective (R does have a package combinat that does this type of work, but here and in general in all the posts I will try to only use basic R):

  • If I only have one value a, there is only a single permutation:{a}
  • If I only have two values a and b, I fix one of the values and look at all permutations I have of the left-over value, then append the value that was fixed:
    - fix a, all permutations of b are {b}, then append the a: {b,a}
    - fix b, all permutations of a are {a}, then append the b: {a,b}
    - the result is {a,b} and {b,a}
  • with three values, repeat the process now that we know all permutations of 2 values:
    - fix a, all permutations of b,c are {b,c} and {c,b}, then append the a: {b,c,a}, {c,b,a}
    - fix b, all permutations of a,c are {a,c} and {c,a}, then append the b: {a,c,b}, {c,a,b}
    - fix c, all permutations of a,b are {a,b} and {b,a}, then append the c: {a,b,c}, {b,a,c}

In R:
SmartPermutation <- function(v) {
  if (length(v) == 1) {
    return(as.data.table(v))
  } else {
    out <- rbindlist(lapply(v, function(e) {
      tmp <- SmartPermutation(setdiff(v, e))
      tmp[[ncol(tmp) + 1]] <- e
      return(tmp)
    }))
    return(out)
  }
}


Computing the score
The next and final step is to fill out the grid for each possible permutation.
This is done by defining a static mapping between cell ID and its neighbors:
We then go through each permutation, fill out a virtual grid (in our case it's not a 3-by-3 but a vector, which doesn't change anything as long as the relationships between neighbors is maintained) and track maximum value.

Neighbors <- function(i) {
  neighbor.list <-
    list(c(2, 4, 5), c(1, 3, 4, 5, 6), c(2, 5, 6),
         c(1, 2, 5, 7, 8), c(1, 2, 3, 4, 6, 7, 8, 9), c(2, 3, 5, 8, 9),
         c(4, 5, 8), c(4, 5, 6, 7, 9), c(5, 6, 8))
  return(neighbor.list[[i]])
}

CompleteGrid <- function(row) {
  grid <- rep(0, 9)
  grid[row[1 : 2]] <- 1
  for (i in 3 : 9) {
    grid[row[i]] <- sum(grid[Neighbors(row[i])])
  }
  return(max(grid))
}

Final answer
Based on our exhaustive search, we were able to identify the following solution yielding a max value of...57!



Little bonus #1: Distribution
The natural question is around the distribution of values? Were you also stuck around a max value in the forties? 57 seems really far off, but the distribution graph indicates that reaching values in the forties is already pretty good!




Little bonus #2: Middle cell
Do you remember our original trade-off dilemma? How early should the middle cell be filled out? In the optimal solution we see it was filled out exactly halfway through the grid.
To explore further, I also kept track for each permutation when the middle cell was filled out and what its filled value was. The optimal solution is shown in green, the middle cell being filled out in 5th position with a seven.


And a similar plot of overall max value against position in which the middle cell was filled:


It clearly illustrates the dilemma of filling neither too late nor too early: some very high scores can be obtained when the middle cell is filled as 4th or 6th cell, but absolute max of 57 will only be achieved when it is filled in 5th. Filling it in the first or last three positions will only guarantee a max of 40.


Saturday, June 8, 2019

Solving Mathematical Challenges with R: #2 Slicing Cubes

Welcome to the second challenge!
You can find more details on where these challenges come from and how/why we will be approaching them in the original posting.

Challenge #2

The original link to the challenge (in French) can be found here. It's easy to understand but non-trivial to wrap one's head around...
Consider a nice perfect cube. So far so good. Consider slicing this cube with a Samurai sword in one swift straight cut. Now take a look at the 2D shape created by the cut on the cube. What types of shape can it be? We are primarily interested in whether or not it can be a regular polygon (equilateral triangle, square, regular pentagon, regular hexagon...).

Starting simple: Square

Let's look at some of the simpler shapes. Ironically, the simplest solution is not the equilateral triangle but the square. Can the slice lead to a square? After a few virtual cuts in your mind, you will have noticed that if the cut is parallel to any side, the shape of the slice will be a square!


Equilateral triangle

A little trickier, but a few additional virtual slicing and dicing should reveal the answer. Select a corner of the cube (technically, a vertex). Then move away from that corner along the three edges that make up the corner by an equal amount. The dimensional plane going through those three points (there is one and only one such plane: in 2D there is one and only one line going through two distinct points, in 3D there is one and only one plane going through three distinct points) will lead to an equilateral triangle:


The proof is straightforward: because we moved away from the corner by a constant value in all three directions, the three sides of the triangle are all equal to each other (square root of twice the square of the constant value, thanks Pythagoras!).

3-D Plotting in R

Before continuing, one question you might have is how the previous visuals were generated. Let's remind ourselves these posts are not as much about solving the challenge but also about expanding our R skills!
There are multiple functions and packages, but I was aiming for simplicity. I used the lattice package which has the benefit of containing lots of other functions so probably a good idea to have it installed once and for all, not only for the purpose of this challenge. Within that package, wireframe() is the function of interest.
The simplest way to use it is to provide a matrix where the value of the i-th row and j-th column is the altitude at the i-th index on the x-axis and j-th index on the y-axis. This assumes that all points on the x-axis and y-axis are equidistant. Alternatively, you can create a data.frame (say df) with columns for the x, y and z values. Then simply call wireframe(z ~ x + y, data = df) and you're good! There are multiple parameters for the shading, viewpoint angle and so forth...
Let's just run the basic example from the help file using the pre-loaded volcano dataset which is a 87 * 61 matrix:
wireframe(volcano, shade = TRUE, aspect = c(61/87, 0.4), light.source = c(10, 0, 10))



The z-axis label was slightly truncated, but aside from that it's a pretty nice plot for one-line of code!

Back to our cube
We solved the 3 and 4 sided regular polygons, but what about, 5, 6, 7...
We could continue playing around, but visualizing these slices is pretty tricky you'll have to admit!
So how can R help? Well how about generating a whole bunch of planes and looking at the type of slices we obtain?

To do that, a tiny little bit of algebra first. The general equation of a 2D plane in 3D is:
z = a + b * x + c * y
[well the purists might disagree and they would be correct, how about the vertical plane satisfying x = 0? It's not covered by the previous equation]
However the planes not covered are all the x = constants and y = constants which either don't slice the cube, or slice parallel to certain sides of the cube which we showed leads to squares!

So now, all we need to do is generate a whole bunch of a, b and c values, and see where the resulting plane intersects the cube.

Step 1: Edge intercepts
Again a little algebra, but there are 12 edges the plane could potentially intercept:

  • the four vertical edges, parallel to the z axis, corresponding to {x=0 and y=0}, {x=0 and y=1}, {x=1 and y =0} and {x=1 and y=1}
  • the four edges parallel to the x-axis
  • the four edges parallel to the y-axis

In each case we can fit two of three values of x, y and z into the equation of the plane and get the third value.
Example: for x=0 and y=1, the equation becomes z = a + c. So the plane intercepts that edge at {x=0, y=1,z=a+c}.

So we end up with 12 points.

Step 2: Valid edge intercepts
For an edge to be valid, the third computed coordinate should be between 0 and 1. In our previous example, {x=0, y=1,z=a+c} is part of the cube if and only if 0 <= a + c <= 1.
We then filter to only keep the valid intercepts.

Step 3: Resulting figure?
Once we know how the cube was intercepted, we need to determine if we luckily stumbled across a special polygon!
To do that, I did the following:
  • compute all distances between intercepts (only once so as not to compute everything twice, needless to have both A --> B and B --> A)
  • keep the smallest value(s)
  • count the number of occurrences of that smallest value
  • if the number of occurrences is equal to the number of intercepts, we have ourselves a winner!
Illustration
Consider the square ABCD of side 1. I would compute all 6 possible distances:
  • A --> B: 1
  • A --> C: sqrt(2)
  • A --> D: 1
  • B --> C: 1
  • B --> D: sqrt(2)
  • C --> D: 1
I keep the minimal values:
  • A --> B: 1
  • A --> D: 1
  • B --> C: 1
  • C --> D: 1
I have four matches of the minimal value, same as number of points, I have a square!

The reason this works is that a regular polygon is convex, so the sides are the smallest distances between points (convex polygons on the left, concave on the right):



Running through a whole bunch of values for a, b and c (remember expand.grid() from the first challenge?) I got the following results for number of identified regular polygons based on number of intercepts:

#Intercepts   Not Regular   Regular
          0       2965368         0
          1         39049         0
          2           564         0
          3       1680834     54885
          4       1761091      3884
          5       1084221         0
          6        530505       196
          7             4         0


So a whole bunch of equilateral triangles which makes sense given the technique we developed to generate them ourselves earlier.
Quite a few squares, including some non-trivial ones (non-parrelel to a side):

We did find some regular hexagons:

However, not a single pentagon nor heptagon in sight.
Doesn't mean they don't exist, but again, we used R here to give us some leads, not fully resolve geometrical proofs!

Hope you enjoyed!





Tuesday, May 21, 2019

Solving Mathematical Challenges with R: #1 Palindromes

This post is the first challenge from what will hopefully be a long series!

You can find more details on where these challenges come from and how/why we will be approaching them in the original posting.

Challenge #1

Let's jump right in with the first challenge! You can find the original presentation of the challenge (in French!) here.

It goes as follows: a number is said to be a palindrome if the digits read left to right are the same as if read right to left. An example would be 151, or 70355307. The first question is simply to determine the number of palindrome numbers that have 351 digits. As a bonus question: what is the minimal difference between two palindrome numbers with 351 digits?

Naturally the first step is to get a better understanding of what we are dealing with, independently of how we want to tackle the problem.





Definitely seems like a small pattern is emerging! If we were to continue with the mathematical approach we would look at N = 4, possibly N = 5 to better understand the dynamics, and start separating the case of N even with N odd. Let's take the R route for now.

How can we generate/identify all palindromes with N digits?

One first approach is to generate all numbers of N digits, flip them, and keep those where the flip left the number unchanged:

PalindromesSlow <- function(n) {
  numbers <- seq(10 ^ (n-1), 10 ^ n-1)
  flipped <- as.numeric(sapply(
    lapply(strsplit(as.character(numbers), split = ""), rev),
    paste0, collapse = ""))
  palindromes <- numbers[numbers == flipped]
  return(palindromes)
}

Why the “Slow” in the name? While this function performs exactly as expected, it won’t scale well if we start looking at numbers 7-digit long and larger. The main reason is that we are generating every single potential number and only selecting an ever-decreasing fraction of those meeting our palindrome criteria. Why ever-decreasing? Let’s look at the trend for our first values of N:
  • N = 1: 9 palindromes, 9 total numbers: 100%
  • N = 2: 9 palindromes, 90 total numbers: 10%
  • N = 3: 90 palindromes, 900 total numbers: 10%
  • N = 4: 90 palindromes, 9000 total numbers: 1%
Because palindromes are determined by the first half of their digits, we shouldn’t try generating all possible numbers with N digits, but only all possible first halves, then mirror image them to generate the second half. For example, if we want to generate all 6-digit palindromes, let us first generate all combinations of 3-digits XYZ, then flip and switch to get XYZZYX. For odd number of digits, it’s only slightly trickier as the middle element acts as the mirror and has no constraints. For 7-digits we would simply generate the first 4 digits, but flip the first 3:
XYZT --> XYZTZYX.

length.firsthalf <- ceiling(n / 2)
digit.list <- rep(list(0 :9), length.firsthalf)
# remove 0 as option for first digit
digit.list[[1]] <- 1 : 9
# expand grid
eg <- expand.grid(digit.list)

A quick word on the expand.grid() function which has saved me on multiple occasions. It essentially generates all possible cross-combinations of arguments provided. Easier to see through a simple example:

expand.grid(c(1, 2, 3), c("a", "b"))

Var1 Var2
   1    a
   2    a
   3    a
   1    b
   2    b
   3    b

For our PalindromeQuick function it is incredibly valuable as we want something generic enough to adjust to any n. So we create a list of length n with all possible options at each position, and stick that into expand.grid and voila! No need for complicated nested for loops, especially as we wouldn't know how many for loops to nest in the first place!

We can then wrap that in a function to determine how many palindromes were found as well as minimal distance between consecutive palindromes. The only trick is carefully handling n being even or odd as discussed previously.

PalindromesQuick <- function(n) {
  length.firsthalf <- ceiling(n / 2)
  digit.list <- rep(list(0 :9), length.firsthalf)
  # remove 0 as option for first digit
  digit.list[[1]] <- 1 : 9
  # expand grid
  eg <- expand.grid(digit.list)
  if (n %% 2 == 0) {
    flipped.ending <- apply(eg, 1, function(row) {
        paste0(rev(row), collapse = "")
      })
  } else {
    flipped.ending <- apply(eg, 1, function(row) {
        paste0(rev(row[- length(row)]), collapse = "")
      })
  }
  eg$ending <- flipped.ending
  palindromes <- sort(as.numeric(
                   apply(eg, 1, paste0, collapse = "")))
  return(palindromes)
}

Having written a slow and quick function, we should pause to think about how much faster quick is, and talk about timing our R functions.

Timing in R

When I started using R, my trick for timing functions was simply to record the absolute time before and after execution and take the difference:

t1 <- Sys.time()
output.slow.7 <- PalindromesSlow(7)
t2 <- Sys.time()
t2 - t1
1.174394 mins

I later discovered it could all be done in one go with the built-in system.time() function:
system.time(output.slow.7 <- PalindromesSlow(7))
   user  system elapsed 
 68.348   0.544  69.016 

They each have small caveats one should be aware of in how they track elapsed time, but for most timing purposes return exactly what you care about. system.time() has the advantage of being a one-liner, but the former solution is easier to use if you have multiple lines you want to time and also makes it easier to comment in/out of code.

From now on we'll solely focus on the Quick version, a 600X improvement is definitely something we'll take!

A quick comment on efficiency and speed: it's always important to weigh the pros and cons. While a function that runs faster always seems preferable on paper, one must also consider the necessity of speeding in the first place, as well as the additional complexity of speeding things up. Here the quick version was just a few lines longer and simple to understand and the increase in speed will allow us to explore numbers that are 14 digits long as quickly as the slow version took to explore 7-digit numbers. However, this is not always the case, we'll see this in challenge #3 where trying to be efficient is not going to be straightforward to implement and scale, and the gain might still be insufficient to explore the higher dimensions.

Results

Using our PalindromeQuick function, let's look at the number of palindromes and minimal distance between consecutive palindromes:

N Number of Palindromes Minimum Difference
1 9 1
2 9 11
3 90 10
4 90 11
5 900 11
6 900 11
7 9000 11
8 9000 11
9 90000 11
10 90000 11
11 900000 11
12 900000 11
13 9000000 11
14 9000000 11

Although we haven’t provided a formal mathematical proof, I think we at  least have a good idea of the challenge's answers.

If n is even, we have as many palindromes as there are combinations of n / 2 digits with the first one not being a 0, that’s 9 * 10 ^ (n/2 - 1).
If n is odd, we have as many palindromes as there are combinations of (n + 1) / 2 digits with the first one not being a 0, that’s 9 * 10 ((n + 1) / 2 - 1).

Minimal difference is somewhat trickier...




Based on the simulations, minimal value fluctuates a bit at first, but definitely seems to quickly stabilize at 11 for numbers with 4 or more digits.

It’s an interesting problem: if we want to make a small change to an existing palindrome to create the next small palindrome, we would typically want to make small changes to the right hand side of the number (units, tens...). But because of the palindromic nature of our numbers, small changes to the right will lead to the same change on the far left and thus a huge delta between our two numbers. The obvious option is to aim for the middle, but that’s still an absolute change of the order of 10 ^ (n / 2).

If we look at examples with small n, we see we can do better when we have nine(s) in the middle: 393 and 404, 5995 and 6004. This is basically our initial trick of incrementing the “middle” but after 9 we can 0 and need to increment the digits to the immediate right and left: 37973 increments to 38983 (delta of 1010), but 39993 increments to 40004 (delta of 11).
So we've reached the point where we've proven the delta is no bigger than 11 as X99...99X (with X any digit between 1 and 8) and (X+1)00...00(X+1) will have a delta of 11.There will be 8 such instances, no matter the number of digit of the palindrome. Proving that 11 is actually the best we can do is a much tougher question.

Back to the challenge, we can quite confidently answer the first challenge that there are 9e175 351-digit palindromes, and that minimal difference will be 11.

Before we sign off, let’s finish with a small plot of the distribution of differences between consecutive palindromes. Given our previous comments and at a very high-level, we should expect a big spike around 10 ^ (n/2), then a spike about one-tenth of that for 10 ^ (n / 2 - 1) whenever the middle number is 9, then a spike one-hundredth smaller at 10 ^ (n / 2 - 2) whenever the middle number is 999 and so on. And as noted previously,  8 cases with a delta of 11.
Turns out we weren’t too far off !

For N = 13:

For N = 14:

For the graphs, despite the numerical nature of the x-axis, we naturally went for barplots instead of histograms given the very peculiar discrete distribution!