Each time you sign in to a password-protected website, your username and password are compared against records stored in databases. If they match, in you go. Password storage isn’t a programmer’s typical concern - similar to firing pistons in a car engine, they’re so deeply embedded in the engine that a car manufacturer doesn’t really have to consider how they work. Programmers sit on top of application frameworks that do a lot of tasks automatically, without having to care about how exact details work. When the user enters a password, it is somehow stored safely and securely, and above all, automagically. This article takes the magic out of the picture, and explores exactly how passwords are securely stored.
Let’s start by defining “secure storage”. Password storage is considered secure when no additional information can be obtained, even if the “book of passwords” is handed over to a malicious user. Even if one is leaked, securely stored passwords do not pose a security threat to the website as a whole.
How can programmers minimize the impact of a security breach? How can open password lists be exploited, and what defenses can be taken? Here are some core elements of securely stored passwords:
Stored Passwords Must be NON-REVEALING
Consider a database of 3 users. Their passwords are stored in plain text as following:
ID |
Username |
Password |
1 |
Alice |
123 |
2 |
Bob |
777 |
3 |
Carol |
123 |
A common technique for concealing a password is using a “hash”, also known as a “digest”. A hash has 2 key properties: 1) one-wayness, and 2) uniqueness.
For example, a common digest algorithm, MD5, hashes the string “123” into 202cb962ac59075b964b07152d234b70.
Looking at the result string, you can’t figure out that it’s derived from “123”. This is the one-wayness of the hash, as the process is irreversible. When a website compares the password entered, it compares the digest of the password against another digest. At no point would the server ever know what the password is.
Consider the hash scheme: add 1 to each number. The passwords from the table above are stored as “234, 888, 234” respectively. It takes some guesswork, but the stored numbers are reversible. So, adding 1 to each number does not satisfy the one-wayness property.
What about using the sum of all digits as a hash? That would return “6, 21, 6”. While these numbers aren’t reversible, by definition, they fail to meet the uniqueness property. There are texts other than “123” that can generate the same digest as that of “123”. While in theory this shouldn’t happen, when it does, it’s called a “hash collision’. A good algorithm should have almost no chance of collision. In this regard, MD5 isn’t very good.
[fig 1. “1234” stored in md5 hash, compared in hash]
Stored Passwords Must be UNIQUE
For clarity purposes, let’s use MD5 as the default hash algorithm for the rest of the article. The MD5 hash of “123” will always be the same, regardless of the database where the hash is stored. Security researchers and hackers alike have figured out ways to identify common hashes.
Direct hash is easy to detect and poses several threats. First, the hash of one database can be reused in another - If Bob uses 777 for all his passwords, they are all leaked when the hash of one database is compromised. Knowing the uniqueness of the hash, it can be derived that the password to the other website is also 777. It doesn’t matter how “secure” a password is set - It takes only one careless user to set his password to a known hash, such as “12345”. Scanning for known hashes is one of the first steps when processing a harvested data dump.
One defence against a security breach is to prefix the password text with a secret phrase. Different websites can use different phrases. For example, Site A uses “private”, and Site B uses “secret”. Again, because of uniqueness, MD5(“private123”) and MD5(“secret123”) give vastly different results. This secret phrase, is called a “salt”. The digest of a string that’s altered by a salt is called a “salted hash”.
[fig 2. “1234” salted differently in separate systems, hash non-comparable]
Stored Passwords Must be NON-DETERMINISTIC
The uniqueness of a hash provides the basis for comparison without knowing the original text. The issue with this process is that it is deterministic - the same string always ends up with the same hash. Why is this a problem? Look at our sample database of 3 users. Alice and Carol have the same password, and therefore the same password digest. Without knowing the salt we can still draw statistical inferences. High occurrence of the same hash means they belong to the same user, or are derived from the same, possibly popularly unsecure, passwords. Ones that are like “12345”, or “password”, for example. Given a large enough sample size, a histogram of the hashes can be built, normalizing their distribution, and opening up the ability to link the hashes to another database, even if they have different salts.
Deterministic hash is also undesirable for a single user. Look at the following hashes of John’s password, taken over the course of 3 months:
January: a02cb962ac59075b964b07152d234b70
February: c59075b202cb234b70962a964b07152d
March: a02cb962ac59075b964b07152d234b70
It’s not clear what John’s password, as it’s stored as salted hash. It’s also not clear how his password relates to other users, or other websites, as the data only shows his individual records. Yet it is clear that John changed his password back to the one he was using in January.
Randomness can be added by encrypting the hashes. Unlike hashes, which are irreversible, encrypted text can be reverted through decryption. Before explaining How encryption benefits hash storage, let’s explore more on Where encryption is used in a password storage scheme.
[fig 3. Passwords stored in different methods]
The passwords in the above diagram may seem indifferent, but they are generated by different methods. Only one of these methods is correct.
Correct: Original Password → Digest ←→ Encrypted Digest
Incorrect #1: Original Password ←→ Encrypted Password
Incorrect #2: Original Password ←→ Encrypted Password → Digested Encrypt
The same text is encrypted to different results each time, and the different encrypted text can come back to the same original text. More on how encryption works soon - First let’s look at what is wrong with storing “Encrypted Password” and “Digested Encrypt”.
“Encrypted Password”: is incorrect because in theory an encrypted text can be decrypted. It’s not as irreversible as, say, a hash that is by definition one-way. Also the owner of the website, or the programmer with all the salts and keys can figure out the original password of a user - a serious violation of security as well as privacy.
“Digested Encrypt”: is incorrect because encrypted text varies, and their digests vary accordingly. A hash can only be used for comparison. When they are different, there is nothing to compare.
An “Encrypted Digest” is correct. Here’s why:
Most encryption algorithms use a user-specified key, much like our salt, to process the target text. Take a look at this oversimplified model where we add a number to each digit. Pick the number 3, and we “encrypt” the string “1234” we get “4567”. Decrypting the string by subtracting 3 from each digit we get “1234” back again. So 3 is the encryption key.
[fig 4. “1234” salted differently first, then hashed, then stored encrypted. hash comparison takes place after decryption]
Encryption is a reversible process. Each time the same text is encrypted, a different result is yielded. Yet this result can be decrypted to the same digest, so it can be determined whether the supplied password matches the record. The stored password does not show the actual password, and is random enough to not reveal other information.
Now that’s good password storage!