This isn’t really fewer parts than the reference solution. That one has an unnecessary ramp in the bottom left corner which never gets used because the puzzle constraints prevent an overflow of register B (and I’ve removed it in this solution, hence 30 instead of 31 parts).
However, this one is more efficient and runs in 75% of the time, using only 3*A balls instead of 4*A. We replace the two-bit construction (which results in a 4-cycle) with the gear-bit/bit combination from puzzle 39 (which results in a 3-cycle). That way, we don’t need to waste any balls.