To expand upon Steven Stadnicki's idea, we can quicky construct a naive algorithm that does better than matrix multiplication using the Discrete Fourier Transform.
We count the number of ones in A. If less than half the bits are ones, we construct a linked list of their positions. To multiply, we simply shift B left by each position in the list (multiplying by that bit that's represented) and add the results.
If more than half of the bits are ones, we do the same as above, but we use the zeros instead to populate the list of positions. The idea is that we'll subtract this sum from the sum that would be obtained by multiplying by all ones. To get the sum of all ones, we shift B by the number of bits in A and subtract B from this. Then we can subtract our sum obtained from the linked list.
We can call that the naive linked-list algorithm. Its running time is O(n2) in the worst case, but O(|B||A|2π−−−√) in the average case, which is faster than DFT for small |A|.
To use the idea of lists optimally, we use divide-and-conquer. We split A in half, and find the sizes of the associated lists using the naive algorithm. If they're greater than 5, we call the naive algorithm again on halves greater than 5 until we manage to cut all halves to less than five. (This is because we can reduce this to 4 subtractions)
Even better still, we improve our divide-and-conquer algorithm. We iterate through all possible combinations of branching, greedily picking the best one. This preprocessing takes approximately the same time as the actual multiplication.
If we are allowed infinite freedom with pre-processing, we solve the optimized divide-and-conquer algorithm for all branches optimally. This takes time O(2|A|) in the worst case, but it should be ~optimal by addition chain methods.
I'm working on calculating more exact values for the above algorithms.