Thursday, July 23, 2009

 

SSE2 Data Transfer/Packed Arithmetic Instruction - Example

SSE2 : Single Streaming Extensions2
SIMD: Single Instruction Multiple Data

This example shows the operation of 3 SSE2 instructions:

a) MOVLPD - SSE2 Data Transfer Instruction
b) MOVHPD - SSE2 Data Transfer Instruction
c) ADDPD - SSE2 Packed Arithmetic Instruction

The registers used in the example are the extended MMX registers (hence the abbreviation XMM). The x86 architecture provides for 16 XMM registers in 64-bit mode and 8 registers in 32-bit mode.

The XMM registers are 128 bit registers. These registers can be imagined as having 2 parts: a lower and a upper part of 64 bits each.

MOVLPD - Moves Data to the lower part of the XMM register. (bits 63:0)
MOVHPD - Moves Data to the upper part of the XMM register. (bits 127:64)
ADDPD - Adds the packed values in the two registers and saves the result in the destination register.

The instruction addpd xmm1, xmm0 works as explained under:

xmm1[63:0] <- xmm0[63:0] + xmm1[63:0]
xmm1[127:64] <- xmm0[127:64] + xmm1[127:64]

Here is a simple example that utilizes all these instructions:

1. The goal of this example is to add mm0_data_low (1.5) to mm1_data_low (2.5) and mm0_data_high(2.5) to mm1_data_high(2.0).

2. By using the SIMD instructions adding 2 different pairs of floating point numbers is done in a single instruction. Hence the name SIMD - Single Instruction Multiple Data.


//////////////////////////////////
section .data
mm0_data_low dq 1.5
mm0_data_high dq 2.5
mm1_data_high dq 2.0
mm1_data_low dq 2.5

section .text

global _start

_start:
nop

; xmm0[63:0] <- 1.5
movlpd xmm0, [mm0_data_low]

; xmm0[127:64] <- 2.5
movhpd xmm0, [mm0_data_high]

; xmm1[63:0] <- 2.0
movlpd xmm1, [mm1_data_low]

; xmm1[127:64] <- 2.5
movhpd xmm1, [mm1_data_high]

; xmm1[127:64] <- xmm0[127:64] + xmm1[127:64]

; xmm1[63:0] <- xmm0[63:0] + xmm1[63:0]

addpd xmm1,xmm0

mov eax, 1
mov ebx, 0
int 0x80
//////////////////////////////////////////////


Lets run this program through gdb and see what the values are:
We expect the following values in XMM1:
xmm1[127:64] = 4.5
xmm1[63:0] = 4.0

After loading the low-part of xmm0:

(gdb) p $xmm0
$2 = v2_double = {1.5, 0}
xmm0 low-part is 1.5

Now load the upper-part of xmm0:

(gdb) next
14 movhpd xmm0, [mm0_data_high]
(gdb) p $xmm0
$3 = v2_double = {1.5, 2.5}
xmm0 upper-part is 2.5 and xmm0 low-part is 1.5

Now load the low-part of xmm1:

(gdb) next
15 movlpd xmm1, [mm1_data_low]
(gdb) p $xmm1
$4 = v2_double = {2.5, 0}
xmm1 low-part is 2.5

(gdb) next
16 movhpd xmm1, [mm1_data_high]
gdb) p $xmm1
$5 = v2_double = {2.5, 2}
xmm1 upper-part is 2.0 and low-part is 2.5

Finally, the addpd:
(gdb) next
17 addpd xmm1,xmm0
p $xmm1
$6 = v2_double = {4, 4.5}

This agrees with our expected result of xmm1[127:64] = 4.5 and xmm1[63:0] = 4.0.


Tuesday, June 16, 2009

 

String Instructions - scasb,scasw,scasd,scasq

The x86 architecture offers different types of instructions to perform various string operations . Scan string instruction is one of them. There are different flavors of the scan string instruction: scasb (byte form), scasw(word), scasd(double word) and scasq(quad word).

scasb: Will compare the byte at AL with the byte value in ES:EDI and sets the flags accordingly.
scasw: Will compare the word at AX with the word value in ES:EDI and sets the flags accordingly.
scasd: Will compare the dword at EAX with the dword value in ES:EDI and sets the flags accordingly.
scasq: Will compare the qword at RAX with the qword value in ES:(E/R)DI and sets the flags accordingly.

When the scas* instructions are used with the repeat prefix they become very powerful. For eg: The scasb instruction can be used with the repne(repeat not equal) prefix to compute the string length.
Here is an alogrithm of how the scasb instruction works when used with the repne prefix:

1. cmp AL with ES:EDI
2. If they are equal jump to 5 else goto 3.
3. if(DF==0) EDI = EDI+1 else EDI=EDI-1
4. jmp to 1
5. DONE

The DF above is the direction flag which controls the direction in which the string operation proceeds. If DF is 0, then after every iteration the value in EDI is incremented. If DF is 1, then after every iteration the value in EDI is decremented. The value by which EDI is incremented depends upon what version (byte/word/dword/qword) of scas is used. For the string length, use of scasb keeps it simple.

To control the direction flag, use the std/cld (set/clear direction flag) instructions. Assume AL has 0 (which is the NULL character in the string). At the end of the iteration if you subtract the final value of edi from the initial value of edi and then subtract the result by one, you end up with the string length.

string length = final edi - initial edi - 1;

Here is an example program:


-----------------------------------------------
section .data
mystring db "Siddharth", 0
mystrlen dd 0


section .text
global _start
_start:
nop
mov ax, ds
mov es, ax ; Initialize ES
mov edi, mystring ; Initialize EDI and EBP to point to the
mov ebp, mystring ; string in memory.
cld ; Clear eflags.df
mov ecx, 255 ; set ecx to a high value
mov al, 0 ; Initialize al with null character.
repne scasb ; scan bytes in the string
dec edi
sub edi, ebp ; This should put the string length in edi.
mov dword [mystrlen], edi; store string length in memory

; Use the stringlength as the exit-code
mov ebx, [mystrlen]
mov eax,1 ; 'exit' system call
int 80h ; call the kernel
---------------------------------------------


After assembling and running the program the exit code will contain the string length.
Typically executing 'echo $?' gives the exit code of the last command the shell executed. In this case, you will have a value of 9 which is the string length.



Labels: , , ,


Monday, May 4, 2009

 

Using FLDPI and FMUL instructions

The FLDPI and FMUL instructions can be used in conjunction with FSIN to generate an accurate result for SIN (PI/2). The FLDPI will load PI in ST0. We can then use FMUL to multiply ST0 with 0.5 to generate PI/2 in ST0. FSIN then computes Sin (PI/2).

Note: Instead of FMUL with an operand of 0.5, you can also use FDIV with an operand of 2.0


------------------------------------------------------------
section .data
data dd 0.5

section .text
global _start
_start:
nop
finit
fldpi ; load PI in ST0
fmul dword [data]
fsin
------------------------------------------------------------
GDB:

10 fldpi ; load PI in ST0
1: $st0 = 0

11 fmul dword [data]
1: $st0 = 3.1415926535897932385128089594061862 ---> PI

12 fsin
1: $st0 = 1.5707963267948966192564044797030931 ---> PI/2

14 mov eax,1
1: $st0 = 1 ---> Sin (PI/2)
------------------------------------------------------------

 

Using the FSIN instruction to compute SINE of an angle

Use the fsin instruction to compute the sine of an angle.
The fsin instruction has an implied operand (top of the floating point stack).
The top of the floating point stack is denoted by ST0 register.

The fld instruction loads the floating point value from the location 'theta' into ST0:
fld dword [theta]

Now the operand is in st0. The fsin instruction now computes the sine of the angle in st0 and overwrites st0 with the result. Note: The operand to fsin is in radians(not degrees). In the example below thetha contains 1.57 ( ~PI/2 radians). So at the end of the program we expect a value of Sin(PI/2) in ST0.


-----------------------------------------------------
section .data
theta dd 1.57


section .text
global _start
_start:
nop
finit
fld dword [theta]
fsin

mov eax,1
mov ebx, 0
int 0x80
-------------------------------------------------

It is easy to see the program-flow with gdb:

9 finit
1: $st0 = 0 -> ST0 = 0

10 fld dword [theta] <--- next instruction to be executed.
1: $st0 = 0

11 fsin <--- next instruction to be executed
1: $st0 = 1.57000005245208740234375 (Result of FLD is in ST0)

13 mov eax,1
1: $st0 = 0.99999968297360224280629845128309796 <--- (Result of FSIN is in ST0)

--------------------------------------------------------------
The value in ST0 is 0.9999 which is close to the expected value of 1. Note that the inaccuracy comes from the fact that our input is 1.57 which is only an approximation of PI/2.

Saturday, May 2, 2009

 

Representation of Floating point numbers in a microprocessor

A floating point number can be represented in any of the following formats:
a)Single Precision (32 bits)
b)Double precision (64 bits)
c)Double Extended precision (80 bits)
Any floating point number is of the form = mantissa * 2 ^ exponent

Single precision uses 32 bits:
1 bit (bit 31) is for the sign.
8 bits (bits 30:23) is for the exponent. Note that the exponent is biased.
23 bits (bits 22:0) is for the mantissa.

Double precision uses 64 bits:
1 bit (bit 63) is for the sign.

11 bits (bits 62:52) is for the exponent. Note that the exponent is biased.
52 bits (bits 51:0) is for the mantissa.

Double-Extended precision uses 80 bits:
1 bit (bit 79) is for the sign.
15 bits (bits 78:64) is for the exponent. Note that the exponent is biased.
64 bits (bits 63:0) is for the mantissa.



How is a number like 10.25 represented in single precision format?

10.25 (base 10) = 1010.01 (base 2)
The first step is to normalize the binary number. Normalization is to represent the number of the form 1.xxxxxx.
1010.01 = 1.01001 * 2 ^ 3
The above representation is of the form mantissa * 2 ^ exponent.
Mantissa = 1.01001
Exponent = 3
This is the information that is saved in the floating point registers inside the microprocessor. The thing to note is that since the numbers are always normalized (ie; 1.xxxxx) the '1' in the integer portion of the mantissa is not saved explicitly but is implied. In the case of the exponent, instead of saving the 'real' exponent, a biased exponent is saved.
Biased exponent = Real exponent + 127 (for single precision)
For the above example, biased exponent = 3 + 127 = 130.
So we now have,
Mantissa = 01001 (ommiting the 1 and the decimal point)
Exponent=1000 0010 (130 in binary)
Sign bit = 0 (positive number)

Constructing the 32-bit register we have:
bit 31 sign -> 0
bits 30:23 exponent -> 1000 0010
bits 22:0 mantissa -> 01001000000000000000000

So the register would read the following:
0100 0001 0010 0100 0000 0000 0000 0000
In hex:
0x41240000

The same concept described above can be extended to double precision and double extended precision formats. Note that in double-extended precision format, the integer part (that is implied in single/double precision) is included and is specified explicitly in bit 63.

Most engineers use floating point calculators to see how the number is represented in different formats. I recommend the IEEE floating point calculator here:

http://babbage.cs.qc.edu/IEEE-754/Decimal.html

Labels: , , ,


Friday, April 17, 2009

 

Another Hello World program in assembly using puts

Program that displays 'hello world' using 'puts'.



extern puts ; make puts extern

section .data
msg db "Hello World",0

section .text
global main
BITS 32
main:
mov edi, msg
call puts
mov eax, 0
ret


There are minor changes from a standalone assembly program:
1) notice the extern declaration at the top. This declaration is so that the assembler does not complain about the 'call puts' instruction. The extern declaration just says that the function is declared elsewhere externally and the symbol 'puts' will be resolved during link time.


2) notice that '_start' symbol used in stand-alone assembly has been replaced by 'main'. This is necessary because this is not a stand-alone program and will be linked differently. Hence the _start symbol will be used by the linker elsewhere. If you use the _start symbol in your assembly program , then you will get an error during link time about multiple instances of _start.


3) mov edi, msg - moves the address of the msg buffer into edi.

4) call puts - call the putstring function which displays the message on the screen.



To assemble:
nasm -felf print.asm

To link:
gcc print.o -> this will resolve the symbol puts. So when the assembly code calls puts it will know which function to transfer control to.

Run:
a.out

------------------------------------------------------------------------------------------------------------------------------------
It is also useful to write a simple C program (for eg: hello world program), then use gcc with the -S switch to generate a .S file that contains the assembly. Keep in mind that the assembly generated thus uses the GNU assembly syntax.

if your c program is hello.c, you would do this to generate assembly
gcc hello.c -S

This will generate hello.s with the assembly code.

-----------------------------------------------------------------------------------------------

Thursday, April 16, 2009

 

CPUID - a simple example

CPUID is an instruction used to query processor specific information. The program below executes cpuid (leaf 0 ) that returns the vendor_id. On linux systems, /proc/cpuinfo will have the vendor_id.

%macro exitprog 0
mov ebx, 0
mov eax, 1
int 0x80
%endmacro

%macro showstring 3
mov eax, 4
mov ebx, %1
mov ecx, %2
mov edx, %3
int 0x80
%endmacro

section .data
here:
times 1 dd 0
eos:
times 1 dd 0
ebx_data dd 0
edx_data dd 0
ecx_data dd 0


segment .text
global _start
_start:
mov eax, 0
cpuid
mov dword [ebx_data], ebx
mov dword [edx_data], edx
mov dword [ecx_data], ecx
showstring 1,ebx_data,12
exitprog




Only the first 2 instructions are important. When cpuid is executed with a value of 0 in eax, it returns the vendor_id information in ebx,edx and ecx. The other instructions in the program merely display the string.

Notice the use of macros. The 'showstring' macro is the file-write system call used in the 'hello world' program. The 'exitprog' is the sys-exit system call.


Upon assembling and running the program the following output is obtained:
GenuineIntel

Please note that depending on the CPU, the output might vary.

On an AMD machine, the following output will be obtained:
AuthenticAMD

(In both cases the result is a 12-byte string contained in ebx,ecx and edx).

 

Hello World program in assembly

Why break the tradition?


Hello World program in assembly

section .data
msg db "Hello, world!",0
len equ $ - msg
section .text

global _start

_start:

mov edx,len
mov ecx,msg
mov ebx,1
mov eax,4 ; eax = 4 -> write to file
int 0x80

mov ebx,0

mov eax,1 ; eax = 1 -> exit
int 0x80


Key points:

The only interesting thing in the above program is the instruction int 0x80. int 0x80 is a linux system call. int 0x80 provides several different services. The type of service provided by the system call depends on the value in eax.

The first call to int 0x80 is with eax = 4. Eax=4 signifies 'write to a file'. Information about the file descriptor is provided in ebx register. In this case ebx = 1 which refers to stdout. Note that since we are writing to stdout , there is no need to open the file. ecx specifies the pointer to the data and edx specifies the length. The length is computed by subtracting the address of the first byte from the last byte.

For example, here is what the string layout will look like in memory (assume start address 0x100) :

0x100 -> 'H' 'e' 'l' 'l' 'o' ',' ' ' 'w' 'o' 'r' 'l' 'd' '!' <-0x10C

Each character is assigned a byte. So the starting address is 0x100 and the address of the byte after the last byte is 0x10d.

This is accomplished by the following line of code in the program:
//$ = 0x10d , msg= 0x100, so len = 0xd
len equ $ - msg


The second call to int 0x80 is with eax = 1. Eax=1 signifies the exit system call. Ebx contains the exit status. Think of it is exit(0) that you see in C programs.


Build the program:
I use the nasm assembler. Typically it comes with the linux distribution. just search for nasm and install.

Assemble:
1) nasm -felf hello.asm

will create hello.o in the same directory.

link:
2)ld hello.o

will create a.out

Run:
a.out

The output will be the 'hello , world!' string.

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]