We now divide each scanline into prologue, aligned run, and epilogue.
Basically, we draw enough pixels one-by-one until we reach a grid
intersection. Then we draw full grid cell slices using fast memory
fills. Finally we go back to one-by-one for the epilogue.
This is roughly 2.5x faster in a microbenchmark and no longer dominates
the ImageViewer and PixelPaint resizing profiles.