Thursday, July 23, 2009
SSE2 Data Transfer/Packed Arithmetic Instruction - Example
SIMD: Single Instruction Multiple Data
This example shows the operation of 3 SSE2 instructions:
a) MOVLPD - SSE2 Data Transfer Instruction
b) MOVHPD - SSE2 Data Transfer Instruction
c) ADDPD - SSE2 Packed Arithmetic Instruction
The registers used in the example are the extended MMX registers (hence the abbreviation XMM). The x86 architecture provides for 16 XMM registers in 64-bit mode and 8 registers in 32-bit mode.
The XMM registers are 128 bit registers. These registers can be imagined as having 2 parts: a lower and a upper part of 64 bits each.
MOVLPD - Moves Data to the lower part of the XMM register. (bits 63:0)
MOVHPD - Moves Data to the upper part of the XMM register. (bits 127:64)
ADDPD - Adds the packed values in the two registers and saves the result in the destination register.
The instruction addpd xmm1, xmm0 works as explained under:
xmm1[63:0] <- xmm0[63:0] + xmm1[63:0]
xmm1[127:64] <- xmm0[127:64] + xmm1[127:64]
Here is a simple example that utilizes all these instructions:
1. The goal of this example is to add mm0_data_low (1.5) to mm1_data_low (2.5) and mm0_data_high(2.5) to mm1_data_high(2.0).
2. By using the SIMD instructions adding 2 different pairs of floating point numbers is done in a single instruction. Hence the name SIMD - Single Instruction Multiple Data.
//////////////////////////////////
section .data
mm0_data_low dq 1.5
mm0_data_high dq 2.5
mm1_data_high dq 2.0
mm1_data_low dq 2.5
section .text
global _start
_start:
nop
; xmm0[63:0] <- 1.5
movlpd xmm0, [mm0_data_low]
; xmm0[127:64] <- 2.5
movhpd xmm0, [mm0_data_high]
; xmm1[63:0] <- 2.0
movlpd xmm1, [mm1_data_low]
; xmm1[127:64] <- 2.5
movhpd xmm1, [mm1_data_high]
; xmm1[127:64] <- xmm0[127:64] + xmm1[127:64]
; xmm1[63:0] <- xmm0[63:0] + xmm1[63:0]
addpd xmm1,xmm0
mov eax, 1
mov ebx, 0
int 0x80
//////////////////////////////////////////////
Lets run this program through gdb and see what the values are:
We expect the following values in XMM1:
xmm1[127:64] = 4.5
xmm1[63:0] = 4.0
After loading the low-part of xmm0:
(gdb) p $xmm0
$2 = v2_double = {1.5, 0}
xmm0 low-part is 1.5
Now load the upper-part of xmm0:
(gdb) next
14 movhpd xmm0, [mm0_data_high]
(gdb) p $xmm0
$3 = v2_double = {1.5, 2.5}
xmm0 upper-part is 2.5 and xmm0 low-part is 1.5
Now load the low-part of xmm1:
(gdb) next
15 movlpd xmm1, [mm1_data_low]
(gdb) p $xmm1
$4 = v2_double = {2.5, 0}
xmm1 low-part is 2.5
(gdb) next
16 movhpd xmm1, [mm1_data_high]
gdb) p $xmm1
$5 = v2_double = {2.5, 2}
xmm1 upper-part is 2.0 and low-part is 2.5
Finally, the addpd:
(gdb) next
17 addpd xmm1,xmm0
p $xmm1
$6 = v2_double = {4, 4.5}
This agrees with our expected result of xmm1[127:64] = 4.5 and xmm1[63:0] = 4.0.
Tuesday, June 16, 2009
String Instructions - scasb,scasw,scasd,scasq
Labels: scasb, scasd, scasw, x86 string instructions
Monday, May 4, 2009
Using FLDPI and FMUL instructions
Note: Instead of FMUL with an operand of 0.5, you can also use FDIV with an operand of 2.0
------------------------------------------------------------
section .data
data dd 0.5
section .text
global _start
_start:
nop
finit
fldpi ; load PI in ST0
fmul dword [data]
fsin
------------------------------------------------------------
GDB:
10 fldpi ; load PI in ST0
1: $st0 = 0
11 fmul dword [data]
1: $st0 = 3.1415926535897932385128089594061862 ---> PI
12 fsin
1: $st0 = 1.5707963267948966192564044797030931 ---> PI/2
14 mov eax,1
1: $st0 = 1 ---> Sin (PI/2)
------------------------------------------------------------
Using the FSIN instruction to compute SINE of an angle
The fsin instruction has an implied operand (top of the floating point stack).
The top of the floating point stack is denoted by ST0 register.
The fld instruction loads the floating point value from the location 'theta' into ST0:
fld dword [theta]
Now the operand is in st0. The fsin instruction now computes the sine of the angle in st0 and overwrites st0 with the result. Note: The operand to fsin is in radians(not degrees). In the example below thetha contains 1.57 ( ~PI/2 radians). So at the end of the program we expect a value of Sin(PI/2) in ST0.
-----------------------------------------------------
section .data
theta dd 1.57
section .text
global _start
_start:
nop
finit
fld dword [theta]
fsin
mov eax,1
mov ebx, 0
int 0x80
-------------------------------------------------
It is easy to see the program-flow with gdb:
9 finit
1: $st0 = 0 -> ST0 = 0
10 fld dword [theta] <--- next instruction to be executed.
1: $st0 = 0
11 fsin <--- next instruction to be executed
1: $st0 = 1.57000005245208740234375 (Result of FLD is in ST0)
13 mov eax,1
1: $st0 = 0.99999968297360224280629845128309796 <--- (Result of FSIN is in ST0)
--------------------------------------------------------------
The value in ST0 is 0.9999 which is close to the expected value of 1. Note that the inaccuracy comes from the fact that our input is 1.57 which is only an approximation of PI/2.
Saturday, May 2, 2009
Representation of Floating point numbers in a microprocessor
a)Single Precision (32 bits)
b)Double precision (64 bits)
c)Double Extended precision (80 bits)
Any floating point number is of the form = mantissa * 2 ^ exponent
Single precision uses 32 bits:
1 bit (bit 31) is for the sign.
8 bits (bits 30:23) is for the exponent. Note that the exponent is biased.
23 bits (bits 22:0) is for the mantissa.
Double precision uses 64 bits:
1 bit (bit 63) is for the sign.
11 bits (bits 62:52) is for the exponent. Note that the exponent is biased.
52 bits (bits 51:0) is for the mantissa.
Double-Extended precision uses 80 bits:
1 bit (bit 79) is for the sign.
15 bits (bits 78:64) is for the exponent. Note that the exponent is biased.
64 bits (bits 63:0) is for the mantissa.
How is a number like 10.25 represented in single precision format?
10.25 (base 10) = 1010.01 (base 2)
The first step is to normalize the binary number. Normalization is to represent the number of the form 1.xxxxxx.
1010.01 = 1.01001 * 2 ^ 3
The above representation is of the form mantissa * 2 ^ exponent.
Mantissa = 1.01001
Exponent = 3
This is the information that is saved in the floating point registers inside the microprocessor. The thing to note is that since the numbers are always normalized (ie; 1.xxxxx) the '1' in the integer portion of the mantissa is not saved explicitly but is implied. In the case of the exponent, instead of saving the 'real' exponent, a biased exponent is saved.
Biased exponent = Real exponent + 127 (for single precision)
For the above example, biased exponent = 3 + 127 = 130.
So we now have,
Mantissa = 01001 (ommiting the 1 and the decimal point)
Exponent=1000 0010 (130 in binary)
Sign bit = 0 (positive number)
Constructing the 32-bit register we have:
bit 31 sign -> 0
bits 30:23 exponent -> 1000 0010
bits 22:0 mantissa -> 01001000000000000000000
So the register would read the following:
0100 0001 0010 0100 0000 0000 0000 0000
In hex:
0x41240000
The same concept described above can be extended to double precision and double extended precision formats. Note that in double-extended precision format, the integer part (that is implied in single/double precision) is included and is specified explicitly in bit 63.
Most engineers use floating point calculators to see how the number is represented in different formats. I recommend the IEEE floating point calculator here:
http://babbage.cs.qc.edu/IEEE-754/Decimal.html
Labels: Double Extended Precision, Double Precision, Floating point, Single Precision
Friday, April 17, 2009
Another Hello World program in assembly using puts
extern puts ; make puts extern
section .data
msg db "Hello World",0
section .text
global main
BITS 32
main:
mov edi, msg
call puts
mov eax, 0
ret
There are minor changes from a standalone assembly program:
1) notice the extern declaration at the top. This declaration is so that the assembler does not complain about the 'call puts' instruction. The extern declaration just says that the function is declared elsewhere externally and the symbol 'puts' will be resolved during link time.
2) notice that '_start' symbol used in stand-alone assembly has been replaced by 'main'. This is necessary because this is not a stand-alone program and will be linked differently. Hence the _start symbol will be used by the linker elsewhere. If you use the _start symbol in your assembly program , then you will get an error during link time about multiple instances of _start.
3) mov edi, msg - moves the address of the msg buffer into edi.
4) call puts - call the putstring function which displays the message on the screen.
To assemble:
nasm -felf print.asm
To link:
gcc print.o -> this will resolve the symbol puts. So when the assembly code calls puts it will know which function to transfer control to.
Run:
a.out
------------------------------------------------------------------------------------------------------------------------------------
It is also useful to write a simple C program (for eg: hello world program), then use gcc with the -S switch to generate a .S file that contains the assembly. Keep in mind that the assembly generated thus uses the GNU assembly syntax.
if your c program is hello.c, you would do this to generate assembly
gcc hello.c -S
This will generate hello.s with the assembly code.
-----------------------------------------------------------------------------------------------
Thursday, April 16, 2009
CPUID - a simple example
%macro exitprog 0
mov ebx, 0
mov eax, 1
int 0x80
%endmacro
%macro showstring 3
mov eax, 4
mov ebx, %1
mov ecx, %2
mov edx, %3
int 0x80
%endmacro
section .data
here:
times 1 dd 0
eos:
times 1 dd 0
ebx_data dd 0
edx_data dd 0
ecx_data dd 0
segment .text
global _start
_start:
mov eax, 0
cpuid
mov dword [ebx_data], ebx
mov dword [edx_data], edx
mov dword [ecx_data], ecx
showstring 1,ebx_data,12
exitprog
Only the first 2 instructions are important. When cpuid is executed with a value of 0 in eax, it returns the vendor_id information in ebx,edx and ecx. The other instructions in the program merely display the string.
Notice the use of macros. The 'showstring' macro is the file-write system call used in the 'hello world' program. The 'exitprog' is the sys-exit system call.
Upon assembling and running the program the following output is obtained:
GenuineIntel
Please note that depending on the CPU, the output might vary.
On an AMD machine, the following output will be obtained:
AuthenticAMD
(In both cases the result is a 12-byte string contained in ebx,ecx and edx).
Hello World program in assembly
Hello World program in assembly
section .data
msg db "Hello, world!",0
len equ $ - msg
section .text
global _start
_start:
mov edx,len
mov ecx,msg
mov ebx,1
mov eax,4 ; eax = 4 -> write to file
int 0x80
mov ebx,0
mov eax,1 ; eax = 1 -> exit
int 0x80
Key points:
The only interesting thing in the above program is the instruction int 0x80. int 0x80 is a linux system call. int 0x80 provides several different services. The type of service provided by the system call depends on the value in eax.
The first call to int 0x80 is with eax = 4. Eax=4 signifies 'write to a file'. Information about the file descriptor is provided in ebx register. In this case ebx = 1 which refers to stdout. Note that since we are writing to stdout , there is no need to open the file. ecx specifies the pointer to the data and edx specifies the length. The length is computed by subtracting the address of the first byte from the last byte.
For example, here is what the string layout will look like in memory (assume start address 0x100) :
0x100 -> 'H' 'e' 'l' 'l' 'o' ',' ' ' 'w' 'o' 'r' 'l' 'd' '!' <-0x10C
Each character is assigned a byte. So the starting address is 0x100 and the address of the byte after the last byte is 0x10d.
This is accomplished by the following line of code in the program:
//$ = 0x10d , msg= 0x100, so len = 0xd
len equ $ - msg
The second call to int 0x80 is with eax = 1. Eax=1 signifies the exit system call. Ebx contains the exit status. Think of it is exit(0) that you see in C programs.
Build the program:
I use the nasm assembler. Typically it comes with the linux distribution. just search for nasm and install.
Assemble:
1) nasm -felf hello.asm
will create hello.o in the same directory.
link:
2)ld hello.o
will create a.out
Run:
a.out
The output will be the 'hello , world!' string.
Subscribe to Posts [Atom]