Using a micro with no hardware multiply? You can optimize square multiples with lookup tables!
The 8088 has a hardware multiplier, but it’s quite slow:
MUL with byte arguments: 69 cycles plus 1 for each set bit in AL plus 1 if the high byte of the result is 0
MUL with word arguments: 123 cycles plus 1 for each set bit in AX plus 1 if the high word of the result is 0
Signed multiplies are even slower (taking at least 80 cycles for a byte, 134 cycles for a word), and depend on the number of set bits in the absolute value of the accumulator, the signs of the operands and whether or not the explicit operand is -0×80 (-0×8000 for word multiplies). I also measured some word IMULs apparently taking a half-integer number of cycles to run, suggesting that there’s either some very weird behavior going on with the 8088′s multiplier or that there’s a bug in my timing program (possibly both).
Can we beat the 8088′s hardware multiplier with a software routine?
To learn how to do this, read further!