Note: I wrote this a couple years ago, and abandoned it because it didn’t have a satisfying conclusion. It’s not a great story narratively but I still think it’s interesting, so here it is!
Some years ago, my coworker Zack wrote a a chess engine in Rust, to which I contributed a bit. Recently, my coworkers and I have started a new game of office chess against it, and it turns out that it runs too damn slow. On an early board position, it didnt come up with an answer after running on my macbook pro after about 8 hours when searching a 9-ply tree. And then the next day I ran it with an 8-ply search and 8 hours was once again insufficient.

I decided to try to make it fast, and then got sort of obsessed for a couple weeks and knocked out, depending on what task you’re running, 60-80% of the runtime. For the rest of this post, assume I’m running all timings against this position (sure hope it’s representative, lol).
An Extremely Brief Introduction To Chess Engines Using α-β Search
This isn’t what this post is about, so here’s the least you should need to know to read this post: Chess is a two-player perfect information game, so it permits an α-β search for finding moves. What you do is go through each move from a position (a ply), and then recurse; when you bottom out (however deep you wanted to search), you score the position based on some heuristics and go back up the tree. You use some algorithmic magic stuff to stop looking at trees when moves are really good or bad.
Chess games can reach the same position via multiple paths (transposition), so you want to cache positions that you’ve already scored, again to avoid looking through big trees of moves, which takes forever because of exponents.
What Not To Do
I don’t recommend taking this approach, but I had an idea for how to speed up the program before I bothered profiling anything, and it turned out I was really, really right. When we were caching the scores for various board states, we were caching the entire series of moves from the start state to the end of the recursion (the variation). That’s a ton of wasted space, and this is a very memory-hungry application in order to cache results as effectively as possible. I didn’t profile the exact mechanics of where things improved because of it, but I moved to only stashing the remaining part of the variation, so we can still stitch the full variation back together higher up in the stack, but save a ton of memory. This was worth a 30-50% speedup; you can see the implementation in this small patch.
In general, “have ideas” has been by far my least successful technique. Most of the things I’ve changed on a hunch have had non-noticeable or even negative impact on performance. Don’t be like me. Profile your application.
On The Right Track
Next up, I made a bunch of random changes and couldn’t tell if any of them helped or not. Eventually I got serious and started using tools. The first thing I tried was cargo-flamegraph, which doesn’t work very well on macs. (Foreshadowing: this story is about to come to the part where I have a second personal breakthrough). I spent a while doing horrible nonsense like booting in recovery mode to disable security settings, and ultimately found out that the tool just doesn’t follow threads on mac OS.
Next up, I found the tool cargo profiler, which installation tools start with “NOTE: This subcommand can only be used on Linux machines.” So I realized if I’m trying to profile code in an immature systems language, I need a Linux box. Fortunately, the cost of 1 Linux VM on EC2 is $0, so from here on out I was mostly developing on an Ubuntu instance.
Tools
I found that cargo flamegraph was extraordinarily valuable. Cargo profiler basically didn’t work for me, and I ended up just running valgrind myself, but the results were less useful than I had hoped. Probably this is a me problem. I mostly leaned on flamegraphs, which are wonderful.
Here’s an interesting portion of a flamegraph from this point.

(this isn’t the bottom of the flamegraph, but it’s a relevant part). The bottom row here, if you squint, says it’s in the α_β_negamax_search
function. Everything above that is various function calls that the app spent time in while also inside the lower function; each column is sort of a cross-section of stack. So a wide column means we spent a lot of time in that function while it was also inside all the functions below it. Here you can see from the second row that while in this depth of the α_β_negamax_search
call, we spent a little in a function (in orange) that I can hover over and see is called WorldState.reckless_lookahead
; about 1/4 in order_movements_intuitively
, and most of the rest in another recursion into α_β_negamax_search
.
This is really valuable information! It was very surprising that we’re spending ~20% of our time in order_movements_intuitively, because it’s just a sort of a list with < 50 elements!
fn order_movements_intuitively(experience: &HashMap<Patch, u32>, commits: &mut Vec) { commits.sort_by(|a, b| { let a_feels = experience.get(&a.patch); let b_feels = experience.get(&b.patch); b_feels.cmp(&a_feels) }); }
The first thing I did was switch sort_by
to sort_unstable_by
, which the docs say is usually faster if you don’t care about stability. The docs are right! But it was still too slow. Next up, I — well, I combined a few functions together, because we were sorting and then immediately re-sorting with a different comparison function, so this part isn’t as narratively compelling — but then I pulled the cache lookups out of the comparison function. This was enough to plummet order_movements_intuitively
down from 20% of the time spent in the negamax function to like 4%, which was great. And it’s the second time in a row that fixing the use of hashes was a big performance win, which it turns out is not a theme; there was just some serious low-hanging fruit there.
Rust Specific Insights
So far, all I’ve done is notice and correct what amount to a couple of significant design flaws. The tools helped me catch the second one, but as it turns out we’re out of this sort of problem. The rest is much harder.
The reason it’s hard is actually a good thing – Rust is really fast. It mostly doesn’t do stupid stuff, and just goes really fast and eats up as much CPU and memory as you’ve got lying around. This is, you’ll note, really fucking cool, and this is one of the reasons writing Rust is fun, especially for a filthy interpreted language pleb like myself. Unfortunately, for all that, it still does get harder.
The next big issue I saw was that some of our functions were spending up to half their time in Vec::extend and/or Vec::push, which themselves are spending a lot of time in low-level stuff like malloc, realloc, and free. This is the part where the project became really fun for me – I write Python in my day job, mostly, and writing python does not tend to lend itself to performance hunting on the level of “too many mallocs.” So the next step is to remove allocations.
This consisted of two main things. The first is using Vec::with_capacity to pre-allocate an appropriate amount of memory for my vectors. Sometimes I had known lengths (e.g. in the sorting of moves as seen above); sometimes I had known maximums (e.g. there will never be more than eight pawns owned by a particular player); sometimes I had good-enough guesses to feel pretty good about it (e.g. there will be more than 2 bishops in few enough positions that it doesn’t matter). Sometimes I had reasonable guesses – I’ve read that most chess positions have in the neighbourhood of 40 moves.*
The second thing is just not creating vectors in the first place. One of the main move-generation functions was delegating to a bunch of sub-functions to generate moves for each piece type and then pushing all those results onto one result vector. I changed it to pass a mutable, pre-allocated vector around and extend onto that.
These changes resulted in allocation-related time spent in the relevant function to drop from ~50% to like <5%, and some other allocations further down the stack to disappear from the perf tool’s sampling entirely.
At this point, from the starting point of right after I fixed the issue with storing full variations in the hashmap to right after I fix all these reallocs, I’m seeing time on a depth 6 search of the relevant position going from about 50-60s to about 30s.
The Bad News
The bad news is this is about where I ran out of ideas and/or energy. I had the temerity to make a couple more small changes by following basically the above pattern, but no more big wins. A couple things showed up that I thought were dumb, and I stamped them out – using a different hash algorithm for some data types was a big improvement on a micro-level but didn’t have a substantial impact on realistic workloads, for example. The program also uses immutable structs for various purposes, including the list of squares on the board and the list of pieces by team; moving those to static variables and returning them instead of re-allocating new ones when they’re needed, again, showed a pretty impressive improvement on the micro benchmark but minimal improvement on actual runs.
What’s left is that there’s just not much to strip out. The program’s threading model could be improved – we kick off one thread per move at the top level position, which frequently results in one expensive move’s calculation taking a long time after the other threads are done. I’m not sure how to make that better while still getting the α-β search benefits.
* I don’t know if this is true, or where I heard it. From unscientifically checking some positions, I’m seeing that 40 moves covers >= 90% of them.