Further parts can be removed for a solution with 30 parts only:
- Direct every second ball that enters the first bit of register A also via the second bit.
The bits are set up for the test with 2 x 3.
https://lodev.org/jstumble/?board=ree0flrf10reerr00eelrr1ere0llerrr1eer0lieerleel_16_16
Remark 1: The set-up does not work for 3 x 3.
Remark 2: There may still be some optimization potential regarding number of balls used, as every third ball out of four balls is running down without reaching the registers.