Have you ever thought, “huh, I wonder how easily I could DDoS myself despite all the hard work of library and framework authors?” In that case, have I got news for you!
In the vein of Uses and Abuses Of Python’s Excessive Dynamism, I have found a way to DDoS yourself via hash collision in a measly 3-ish lines of python!
Background
Since I am an unbearable pedant and this blog post would be about 10 lines long if I didn’t write a bunch of unnecessary exposition, I am going to start out with a brief explanation of hash tables/sets!
Alternatively, you could just read Wikipedia’s explanation.
A hash table is a data structure that is commonly used to implement maps, which you learned about last time. It’s also used to implement sets, which are like maps except, instead mapping keys to values, they just store a bunch of values.
The basic way a hash table works is that we start out with an array of some size, and we consider each index of the array a “bucket.” Then, each time we insert a key (or value, in the case of a set), we take a hash of the key and use that to determine what bucket to insert it in. How a hash is calculated is not important (especially not for this post, as you’ll see!), but it is a deterministic one-way function from any object to an integer. Deterministic means if I ask you for the hash of an object multiple times, it must return the same value, or everything will be ruined for ever.
Once we have the hash, we find the value of that hash modulo the length of the array. For example, if we were working with a backing array of length 10, and our hash was 1238, we would select bucket 8. Then we find what bucket 8 points at, which is generally a linked-list of values of whatever type we’re putting in.
Okay, back up, for concreteness, let’s assume we’re looking at a hash set (not table) of strings. So our backing array is a length-ten array of linked-lists of strings.
That’s better. So we’re at the linked-list that bucket 8 of our array points to. Now we iterate through this list, and make sure that the string we’re putting in isn’t already in there. This means that our string has to be compared with every other string we’ve already put into bucket 8. So every string for which our hash function ends in an 8 will go in this bucket, and we have to compare against each of those every time we add a new string with hash ending in 8.
Now this sounds somewhat scary, right? The whole point of hash sets is that they have O(1) (that is, constant time, also known as “screaming fast”) insertion time and membership test (i.e. “is this string already in my set). But if I have a gazillion strings that hash to something that is 8 mod 10, I have to do a gazillion string comparisons to verify whether or not this particular string that goes in this bucket is, in fact, new or not!
All that is to say that if your hash function has “bad” behaviour (where bad behaviour means hashing a lot of different objects to the same bucket) and/or your backing array is too small for the size of the set you’re working with, all this stuff you expect to be screaming fast will be super slow instead, which is just the worst if you’re a developer. There of course has been a lot of work done by serious computer scientists to come up with great hashing algorithms and heuristics for when it’s time to resize your array to accommodate more objects (strings, in our case), but nothing is stopping you from defining a hash function that is super terrible.
Okay, can we ruin everything now?
Here’s a great way to ruin everything:
class Foo: def __hash__(self): return 1
Now every time you try to add a new Foo
object to a hash set, you’re gonna get them all in the same bucket (this is what we call a hash collision). So just throw a ton of those into a set every time your website loads and your customers will all hate you.
But wait… we can do better than that, right? If we can only make our own objects degenerate, we’ll have to go out of our way to slow our website to a crawl. Couldn’t we improve that somehow? Couldn’t we make our website terrible even in the face of all the hard work computer scientists and framework authors have done to keep things running smoothly?
Of course we can! This is Python! With adequate cleverness, we can ruin anything!
Just stick something like this into your codebase:
for name, thing in globals().items(): if isinstance(thing, type): thing.__hash__ = lambda self: 1
Now all user-defined types will have horrible hashing algorithms, guaranteeing that anywhere they’re used in sets or as keys in a set, every insertion and membership test will be O(n) instead of O(1)! Take that, helpful, thoughtful library authors!
Another terrible option
Try this one on for size:
class Foo: def __hash__(self): return 1 f = Foo() s = {f} print(f in s) Foo.__hash__ = lambda self: 2 print(f in s)
You can turn every dict and set in your codebase into a liar in one swell foop!
Merry self-destruction, everyone.
Unfortunate addendum
You can run something like __builtins__.hash = lambda *args: 1
, but it does not appear to affect the algorithm that’s used to hash strings. This is presumably some implementation detail of Python internals that we can’t mess with. str.__hash__ = lambda *args: 1
throws an error. Sadly, we cannot ruin the most common type of map, string -> x, so easily 😦