The hand-written assembly does not compile under Clang due to register
size mismatches. Using a loop is slower (~6 instructions on O2 as
opposed to 2 with hand-written assembly), but using the pause
instruction makes this more efficient even under TCG.