Tricks for bits/register/matrix operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
by Alain Brobecker (aka baah/Arm's Tech)
(abrobecker@yahoo.com or rte de Dardagny; 01630 CHALLEX; FRANCE)
--------------------------------------------------------------------------------
Purpose: Swapping two registers
From: Sylvain Langlade (aka Nullos/DNT Crew)
Many computer science teacher think you need a temporary variable to exchange
two variables. The following will be handy to mock them, and is also usefull
when you code in assembly and have no register left. It uses the fact that
bitwise operations (and, or..) are commutative and associative.
eor m0,m0,m1 ;m0=aEORb
eor m1,m0,m1 ;m1=(aEORb)EORb=a
eor m0,m0,m1 ;m0=(aEORb)EORa=b
--------------------------------------------------------------------------------
Purpose: Swapping two registers (2)
From: Alain Brobecker (aka baah/Arm's Tech)
And with help by Robert Harley (ThanX =)
This one is easier to recall to mind, and don't use extra register either.
add m0,m0,m1 ;m0=a+b
sub m1,m0,m1 ;m1=a+b-b=a
sub m0,m0,m1 ;m0=a+b-a=b
It always work, even for values excessing the add/sub precision, because
those arithmetic operations are precise modulo 2^32 on a 32 bit processor.
--------------------------------------------------------------------------------
Purpose: Reversing a register
From: Alain Brobecker (aka baah/Arm's Tech)
The main problem is to reverse the bits of a given register, but to keep the
code short I will suppose we have an 8 bits register. Adaptation to 16, 32 or
more bits is quite straightforward, so I leave it to you.
Suppose we want to reverse an 8 bits register m0=b7b6b5b4b3b2b1b0. A simple
method is to test each bit independantly and set their opposite accordingly.
It is done just below, uses 2*8+1 instructions and no extra registers. Using it
on a register of n bits will need 2*n+1 instructions.
mov m1,#0 ;All bits of m1 set to 0
#SET N=0
#REPT 8
tst m0,#1<0
#SET N=N+1
#ENDR
We can notice that bN and bN+(8/2) are distant of 3 bits (counting left to
right or right to left with wrapping) both in original and reversed registers.
So we can use the following which is only 7+1 instructions long, at the expense
of an extra register (1 instruction for m2 initialisation). For a register of
2*n bits this method will require 1+2*(n-1)+1=2*n instructions.
mov m2,#%10001000
and m1,m2,m0,ror #1 ;m1=b0...b4...
#SET N=1
#REPT 3
and m3,m2,m0,ror #N+1 ;m3=bN...bN+4...
add m1,m1,m3,lsr #N ;m1+=bN...bN+4...>>N
#SET N=N+1
#ENDR
Fine, but I give you a method which is 6+2 instructions long, and will prove
significantly faster when applied to bigger registers. For a 2^n bits register
it will need (n-1) extra register and 3*(n-1)+(n-1) instructions.
mov m2,#%10101010
mov m3,#%11001100
and m1,m2,m0,lsl #1 ;m1=b6.b4.b2.b0.
and m0,m2,m0 ;m0=b7.b5.b3.b1.
orr m0,m1,m0,lsr #1 ;m0=b6b7b4b5b2b3b0b1
and m1,m3,m0,ror #2 ;m1=b0b1..b4b5..
and m0,m3,m0,ror #4 ;m0=b2b3..b6b7..
orr m1,m1,m0,lsr #2 ;m1=b0b1b2b3b4b5b6b7
--------------------------------------------------------------------------------
Purpose: Reversing a byte (=Reversing a register 2)
From: Vincent Lefevre
This one uses 8 instructions and only 2 extra registers to reverse a byte.
You'll be able to use it for reversing a 16 bits word, but not a whole 32
bits register (or maybe with an eor trick?). It will be faster than the one
above as long as you don't have many bytes/words to reverse.
orr m0,m0,lsl #8 ;m0=b7b6b5b4b3b2b1b0b7b6b5b4b3b2b1b0
and m1,m0,#&330 ;m1=..b1b0..b5b4....
and m2,m0,#&cc0 ;m2=b3b2..b7b6......
orr m0,m1,m2,lsr #4 ;m0=b1b0b3b2b5b4b7b6..
and m1,m0,#&154 ;m1=.b0.b2.b4.b6..
and m2,m0,#&2a8 ;m2=b1.b3.b5.b7...
orr m0,m1,m2,lsr #2 ;m0=b0b1b2b3b4b5b6b7.
mov m0,m0,lsr #1 ;m0=b0b1b2b3b4b5b6b7
--------------------------------------------------------------------------------
Purpose: Fast 90 degrees Rotation
From: Alain Brobecker (aka baah/Arm's Tech)
(Devised during a dull numerical analysis course =)
Supposing datas are bytes, by slicing the whole matrix/texture into smaller
sets, we can reduce our problem to rotation of 4*4 matrixs which will fit in
four 32 bits registers. Of course this can be adapted for other data sets, such
as 8*8 matrixs of nibbles (4 bits). I won't give related code here, only show
what the matrix operations are. So our problem is to do:
1st long - a1a2a3a4 \ a4b4c4d4
2nd - b1b2b3b4 --------\ a3b3c3d3
3rd - c1c2c3c4 --------/ a2b2c2d2
4th - d1d2d3d4 / a1b1c1d1
My first solution is quite similar to the one used for reversing a register.
An underscore in front of a longword means it was already in previous dataset
and so doen' t need to be computed, but I copy it for clarity. This method
needs 8+4+4+4=20 operations and 8+2 registers. With a reorganistion of the
operations we can reduce to just 6+2 registers. When working with nibbles it
will use 64 instructions to rotate the 8*8 matrix.
a1.a3. a1b1..
.a2.a4 a2b2..
a1a2a3a4 b1.b3. a1b1a3b3 ..c3d3 a1b1c1d1
b1b2b3b4 .b2.b4 a2b2a4b4 ..c4d4 a2b2c2d2
c1c2c3c4 c1.c3. c1d1c3d3 _a1b1a3b3 a3b3c3d3
d1d2d3d4 .c2.c4 c2d2c4d4 _a2b2a4b4 a4b4c4d4
d1.d3. _c1d1c3d3
.d2.d4 _c2d2c4d4
The second method I found is more tricky (my mind shall be messy =). It
requires 6+2+2+4+4=18 operations, the same amount of registers, and only 56
operations for 8*8 matrix (the EOR trick is then used twice). Using the EORs
spares us the isolation sequence of two longwords, but does the same since it
is bitwise oriented, associative and commutative.
a1..a3..
a1a2a3a4 b1..b3.. a1b1a3b3 \
b1b2b3b4 (1) = a2a3a4a1 EOR b1b2b3b4 _(1) ---\
c1c2c3c4 c1..c3.. c1d1c3d3 ---/
d1d2d3d4 d1..d3.. _(2) /
(2) = c2c3c4c1 EOR d1d2d3d4
_a1b1a3b3
a1b1..
\ _a1b1a3b3 _c1d1c3d3 a1b1c1d1
---\ a2b2a4b4 = b1a3b3a1 EOR (1) ..c3d3 a3b3c3d3
---/ _c1d1c3d3 _a2b2a4b4 a2b2c2d2
/ c2d2c4d4 = d1c3d3c1 EOR (2) a2b2.. a4b4c4d4
_c2d2c4d4
..c4d4
--------------------------------------------------------------------------------
Purpose: Find position of the first set bit in a register
From: Alain Brobecker (aka baah/Arm's Tech)
To find the first set bit of a register m0, namely INT(log(m0)/log(2)), I use
a dichotomic approach. Assuming the register m0 is 2^5 bits long, the position
can be found with the following sequence. The unsigned Higher or Same (HS)
condition code is equivalent to Carry Set (CS). Uses 3*n instructions on a 2^n
bits register.
mov m1,#0 ;m1 will contain position of first set bit
]:FORc%=4TO1STEP-1:n%=2^c%:[opt opt%
cmp m0,#1<>n%
]:NEXTc%:[opt opt%
cmp m0,#2
addHS m1,m1,#1
--------------------------------------------------------------------------------
Purpose: Reversing endians
From: Olly Betts
We have m0=abcd where a-d are bytes, and we want to have m0=dcba. This first
method will be handy if you have to perform the operation many times in a loop,
it will take 1+3n cycles to reverse n longwords (without loop etc...) but uses
two extra registers.
mvn m2,#&ff00 ;m2=&ffff00ff
;The following goes in the loop
eor m1,m0,m0,ror #16 ;m1=aEORc bEORd aEORc bEORd
and m1,m2,m1,lsr #8 ;m1=0 aEORc 0 aEORc
eor m0,m1,m0,ror #8 ;m0=dabc EOR m1=dcba
--------------------------------------------------------------------------------
Purpose: Reversing endians (2)
From: Tom Hughes
This one is a slightly different version from the above one, which uses only
one extra register, and requires 4n cycles, which is as fast if we reverse only
one longword (ie n=1).
eor m1,m0,m0,ror #16 ;m1=aEORc bEORd aEORc bEORd
bic m1,m1,#&ff0000 ;m1=aEORc 0 aEORc bEORd
mov m0,m0,ror #8 ;m0=dabc
eor m0,m0,m1,lsr #8 ;m0=dcba
--------------------------------------------------------------------------------
Purpose: An ultra-fast CRC-like routine
From: Kostas Proitsakis (aka GUS/Arm's Tech)
The following is what I use when speed is critical and CRCs are needed.
=> block = pointer to byte block
size = size of block
temp = temp reg
<= crc = pseudo CRC value
block = preserved
size = 0
temp = corrupted
mov crc,#0
.loop subs size,size,#1
ldrgeb temp,[2,3]
addge crc,crc,size,ror temp
bgt loop
--------------------------------------------------------------------------------
Purpose: Create the mask of a sprite on the fly
From: Kostas Proitsakis (aka GUS/Arm's Tech)
The following creates mask for colour 0 on the fly.
=> d = a reg containing sprite data
bm = &80808080 (for 8bit mode), &88888888 (for 4bit mode)
<= d = preserved
bm = preserved
m = mask
orr m,d,d,lsr#1
orr m,m,m,lsr#2
orr m,m,m,lsr#4 ;only for 8bit mode
and m,m,bm
rsb m,m,m,lsl#8 ;4 in 4bit mode
--------------------------------------------------------------------------------
Purpose: Load an unaligned longword
From: David McQuillan
Corrected (wrong lsl/lsr) and modified by Alain Brobecker
The following loads into m1 the unaligned longword found at position m0.
David was using the couple "ldrEQ m1,[m0]" and "ldmNEia m0,{m1,m2}" but it
uses one instruction more, and probably no time less.
andS m3,m0,#3 ;get low bits and set Z flag if 0
ldmia m0,{m1,m2} ;load unaligned words
movNE m3,m3,lsl #3 ;shift=low bits*8
movNE m1,m1,lsr m3 ;if unaligned, shift to right position
rsbNE m3,m3,#32
orrNE m1,m1,m2,lsl m3 ;shift it, and or with first word
--------------------------------------------------------------------------------
Purpose: Searching the middle value out of three
From: Alain Brobecker (aka baah/Arm's Tech)
(because ArmOric needed that one =)
Suppose we have three values in m0, m1, m2 and we want the middle value, ie
the value which would be in second position if the values were sorted. To do
this i search for max(min(a;b);min(max(a;b);c)).
subS m3,m0,m1 ;m3=a-b
addGT m1,m3,m1 ;m1=max(a;b)
subGT m0,m1,m3 ;m0=min(a;b)
cmp m1,m2 ;flags=max(a;b)-c
movGT m1,m2 ;m1=min(max(a;b);c)
cmp m0,m1 ;flags=min(a;b)-min(max(a;b);c)
movGT m1,m0 ;m1=max(min(a;b);min(max(a;b);c))
--------------------------------------------------------------------------------
Purpose: Mixing min|max values
From: Alain Brobecker (aka baah/Arm's Tech)
(because ArmOric needed that one too =)
We have m0=max0|min0 (16 bits each) and m1=max1|min1, and we want the min|max
for the whole set of values, in the same 16 bits format.
cmp m0,m1
movGE m2,m1,lsl #16
movLT m2,m0,lsl #16
movLT m0,m1 ;here m0=max|min?, and m2=other min
cmp m2,m0,lsl #16
addLT m2,m2,m0,lsr #16
movLT m0,m2,ror #16
--------------------------------------------------------------------------------
Purpose: Computing approx value of 1/(2^n-1)
From: James Blinn
We know that dividing by 2^n is the same as making an arithmetic shift right
by n bits. Also, dividing by 2^n-1 can be easily computed knowing that
1/(2^n-1) ~ (2^n+1)/(2^2n), because (2^n-1)*(2^n+1)=2^2n-1, ie we have:
1
----- = 0,0...010...010.. (n-1 zeroes between the ones)
n
2 - 1
--------------------------------------------------------------------------------
Purpose: Aligning adress on the nearest 2^n value
From: Alain Brobecker (aka baah/Arm's Tech)
Suppose we want to align an adress on the nearest 2^n value. For example with
2^2=4 (yep, a longword ;) we have %000+3=%011; %001+3=%100; %010+3=%101 and
%011+3=%110. Then we can use the following sequence:
add m0,m0,#2^n-1
bic m0,m0,#2^n-1