PIN analysis
Ian’s messages made me chuckle. Then, later the same day, I read this XKCD cartoon. The merging of these two humorous topics created the seed for this article.
What is the least common PIN number?
If you had to make predication about what the least commonly used 4-digit PIN is, what would be your guess? This tangentially relates to the XKCD cartoon. In Randall’s cartoon, the perpetrator’s plan backfired because his selected license plate was so unique that it was very memorable. What is the least memorable license plate? Ask any spy you know (snigger) what the best way to blend into a crowd is. Their answer will be not stand out, to appear “normal”, and not be notable in any way.
DISCLAIMERThis article is not intended to be a hacker bible, or to be used as a utility, resource, or tool to help would-be thieves perform nefarious actions. I will only disclose data sufficient to make my points, and will try to avoid giving specific data outside of the obvious examples. I do not want to be an enabler for script-kiddies. Please do not email me asking for the database I used; if you do, you will be wasting your time as I’m not going to respond. I’m not going to sell, donate or release the source data – don’t ask! SourceObviously, I don’t have access to a credit card PIN number database. Instead I’m going to use a proxy. I’m going to use data condensed from released/exposed/discovered password tables and security breaches. Soap Box – Password Database Exposures
I’m not trying to sell my services as a consultant here (though if you are interested, my rates are very reasonable compared to the cost of legal defense, potential FTC sanctions, class action suits, shareholder backlash, fines, loss of reputation and business …) There are plenty of security experts in the industry who can help you (if you need help filtering them and don’t have referrals, someone who has CISSP qualifications is a good place to start). Bottom line Security strengthens with layers, and the simple application of encryption on your database table can help protect your customer’s data if this table is exposed. It does not defend against all possible attacks, but it does nothing but good things. What possible reason is there store things in clear-text? Back to the data
Given that users have a free choice for their password, if users select a four digit password to their online account, it’s not a stretch to use this as a proxy for four digit PIN codes. The DataI was able to find almost 3.4 million four digit passwords. Every single one of the of the 10,000 combinations of digits from 0000 through to 9999 were represented in the dataset.
The first “puzzling” password I encountered was 2580 in position #22. What is the significance of these digits? Why should so many people select this code to make it appear so high up the list?
(Another fascinating piece of trivia is that people seem to prefer even numbers over odd, and codes like 2468 occur higher than a odd number equivalent, such as 1357 ). Cumulative FrequencyAs noted above, the more popular password selections dominate the frequency tables. The most popular PIN code of 1234 is more popular than the lowest 4,200 codes combined! That's right, you might be able to crack over 10% of all codes with one guess! Expanding this, you could get 20% by using just five numbers! Below is a cumulative frequency graph: Statistically, one third of all codes can be guessed by trying just 61 distinct combinations! The 50% cumulative chance threshold is passed at just 426 codes (far less than the 5,000 that a random uniformly distribution would predict). Paranoid yet? Bottom of the pile
Many of the high frequency PIN numbers can be interpreted as years, e.g. 1967 1956 1937 … It appears that many people use a year of birth (or possibly an anniversary) as their PIN. This will certainly help them remember their code, but it greatly increases its predictability. Just look at the stats: Every single 19?? combination can be found in the top fifth of the dataset! Below is a plot of this in graphical format. In this chart, each yellow line represents a PIN number that starts 19?? If all the passwords were uniformly distributed, there should be no significant difference between the frequency of occurrence of, for instance, 1972 and any other PIN ending in seventy two ??72 . However, as we shall see, this is not the case at all. 1972 occurs in ordinal position #76 (with a frequency 0.099363%). Here’s a histogram for the occurrences of all ??72 probabilities. You can clearly see the spike at 1972 (with smaller spikes at 7272 and 1472 ) If you calculate the ratio of the peak of 1972 to the average of all the other ??72 PINS you get the ratio of 22:1 PINS starting with 19?? are much more likley to occur. Of course, it’s not just 1972. Here is plot of the ratio of 19 to non-19 for all hundred combinations. Along the x-axis are all the combinations of last two digits �XX, and for each of these the ratio of the 19XX to average of all the other ??XX occurrences has been calculated. Here’s the chart: It's a pretty good approximation for a demographic chart! (suggested by the red-dashed trend line) which would probably allow a fair estimation of the ages (years of birth) of the people using the various websites. (Of course, hackers invert this strategy and use the age of a target to try and give information to guess a user's PIN. Looking at this graph, this might give them up to a 40x advantage!) Just about all the ratios are above 1.0. The noteable exceptions are ??34 and ??00 (which are easy to explain, since the massive popularity of 1234 and 0000 dwarf 1934 and 1900 respectively). Simiarly 33 44 55 66 … are lower than expected as the quad codes like 3333 mask out even the 1933 boost. There are also spikes in the graph corresponding to the popular PINS of 1919 1984 and 1999 Patterns in data
You could look at this plot all day! The bright line for the leading diagonal shows the repeated couplets that people love to use for their PIN numbers 0000 0101 0202 … 5454 5555 5656 … 9898 9999 . Every eleventh dot on the leading diagonal is brighter corresponding to the quad numbers e.g. 4444 5555 . Here is a larger scale version: Interesting thingsThere are so many interesting things to learn from this heatmap. Here are just a couple:
More than fourThe purpose of this posting was to investigate patterns and frequency of four digit PIN numbers. However, the database I collected also has all-numeric password of different lengths. It's worth taking a quick look at these too. I found close to 7 million all-numeric passwords. Approximately half of these were the four-digit codes we've just examined. Six digit codes are the next most popular length, followed eight. I hope, hope that the people who have passwords of nine digits long are not using their Social Security Numbers! Below are the top 20 passwords for the various lengths, along with their share of their same-size namespace.
Some interesting observations (and a little speculation)For five digit passwords, users appear to have even less imagination in selecting their codes (22.8% select 12345). All the usual suspects occur, but a new addition is the puerile addition in position #20 of the concatenation of 420 and 69. For six digit password, again 696969 appears highly. Also of note is 159753 (a "X" mark over the numeric keypad). James Bond returns with 007007. For seven digits, the standby of 1234567 is a much lower frequency (though still the top). I speculate that this is because many people may be using their telephone number (without area code) as a seven digit password. Telephone numbers are fairly distinct, and already memorized, so when a seven digit code is needed, they spring to mind easily. The higher frequency of usage of telephone numbers reduces the need to use imagination (or lack thereof) and select something else. Is Jenny there? The fouth most popular seven digit password is 8675309 (It's a popular 80's song). Eight digit passwords are just as expected. Lots of pattern, and lots of repetition. Common nine digit passwords also follow patterns and repetition. 789456123 appears as an easy "Along the top, middle and bottom of the keypad" 147258369 is related in the vertical direction (and other variants appear high up). Again we get a 420 moment with 420420420, and also the shaken, not stirred, but repeated 007007007 returns. Interestingly for ten digits 1029384756 appears (alternating ascending/descending digits), as well as the odd/even 1357924680. Hurrah for math! In position #17 of the ten digit password list we get 3141592654 (The first few digits of Pi)Conclusions
Since publishing this article, it's been brought to my attention that, of course, in addition to anniversary years, many people encapsulate dates in the format MMDD (such as birthdays …) for their PIN codes. This clearly explains the lower left corner where, if you look at the heatmap, there is a huge contrast change at the height of around 30-31 (the number of days in a month), extending to 12 on the x-axis. (Thanks to zero79 for first pointing this out). Many people also asked the significance of 1004 in the four character PIN table. This comes from Korean speakers. When spoken, "1004" is cheonsa (cheon = 1000, sa=4). "Cheonsa" also happens to be the Korean word for Angel. Another XKCD cartoonIt only seems appropriate to end with another XKCD cartoon. This one is Password Strength You can find a complete list of all the articles here. Click here to receive email alerts on new articles. |