          ASMLIB.TXT                                     2004-07-13 Agner Fog

User instructions for assembly function library
================================================

This is a library of functions written in assembly language. These functions
are implemented in assembly, either for improved speed or because they cannot
be implemented in high-level language.

The functions in this library can be called from programs written in C++,
assembly and other languages for 32-bit platforms with Intel-compatible
microprocessors under Windows, Linux, BSD and other operating systems.

The three versions asmlibM.lib, asmlibO.lib, and asmlibE.a contain the same
functions and are made from the same source:

*  asmlibM.lib uses the MS-COFF object file format

*  asmlibO.lib uses the OMF object file format

*  asmlibE.a   uses the ELF object file format

For Microsoft C++ compiler, use asmlibM.lib
For Borland C++ compiler, use asmlibO.lib
For Gnu C++ compiler under Linux, BSD, UNIX, etc. use asmlibE.a
For Gnu C++ compiler under Windows, use asmlibM.lib renamed to asmlibM.a
For Borland Delphi Pascal, use the *.OBJ files (OMF format)
For other compilers, you have to try which of the versions work.

To use these functions, add the appropriate library file to your project and
#include "asmlib.h" at the top of your C++ file.

Downloaded from www.agner.org/optimize/

 2003, 2006. All software in this package is copyrighted under the GNU 
General Public License (www.gnu.org/copyleft/gpl.html).



Function descriptions:
======================

extern "C" int Round (double x);
--------------------------------
Converts a floating point number to the nearest integer. When two integers
are equally near, the even integer is chosen. This function does not check
for overflow. This function is much faster than the default way of converting
floating point numbers to integers in C++, which involves truncation.


extern "C" int Truncate (double x);
-----------------------------------
Converts a floating point number to an integer, with truncation towards zero.
This function does not check for overflow. In case of overflow, you may get 
an exception or an invalid result. This function is faster than the standard
C++ type-casting:  int i = (int)x;


extern "C" int MinI (int a, int b);
-----------------------------------
Returns the smallest of two signed integers. Will also work with unsigned
integers if both numbers are smaller than 2^31. This is faster than a C++
branch if the branch is unpredictable.


extern "C" int MaxI (int a, int b);
-----------------------------------
Returns the biggest of two signed integers. Will also work with unsigned
integers if both numbers are smaller than 2^31. This is faster than a C++
branch if the branch is unpredictable.


extern "C" double MinD (double a, double b);
--------------------------------------------
Returns the smallest of two double precision floating point numbers. This 
is faster than a C++ implementation.


extern "C" double MaxD (double a, double b);
--------------------------------------------
Returns the biggest of two double precision floating point numbers. This 
is faster than a C++ implementation.


extern "C" int InstructionSet (void);
-------------------------------------
This function detects which instructions are supported by the microprocessor
and the operating system. (see www.agner.org/optimize/optimizing_assembly.pdf
for a discussion of the method used for checking XMM operating system support).

Return value:
 0          = use 80386 instruction set only
 1 or above = MMX instructions can be used
 2 or above = conditional move and FCOMI can be used
 3 or above = SSE (XMM) supported by processor and enabled by Operating system
 4 or above = SSE2 supported by processor and enabled by Operating system
 5 or above = SSE3 supported by processor and enabled by Operating system


extern "C" int DetectProcessor (void);
--------------------------------------
This function detects the microprocessor type and determines which features
are supported. It gives a more detailed information than InstructionSet().

The return value is a combination of bits indicating different features.
The return value is 0 if the microprocessor has no CPUID instruction.

 bits     value       meaning
----------------------------------------------------------------------------
 0-3      0x0F        model number
 4-7      0xF0        family:  0x40 for 80486, Am486, Am5x86
                               0x50 for P1, PMMX, K6
                               0x60 for PPro, P2, P3, Athlon, Duron
                               0xF0 for P4, Athlon64, Opteron
   8      0x100       vendor is Intel
   9      0x200       vendor is AMD
  11      0x800       XMM registers (SSE) enabled by operating system
  12      0x1000      floating point instructions supported
  13      0x2000      time stamp counter supported
  14      0x4000      CMPXCHG8 instruction supported
  15      0x8000      conditional move and FCOMI supported (PPro, P2, P3, P4, Athlon, Duron, Opteron)
  23      0x800000    MMX instructions supported (PMMX, P2, P3, P4, K6, Athlon, Duron, Opteron)
  25      0x2000000   SSE instructions supported (P3, P4, Athlon64, Opteron)
  26      0x4000000   SSE2 instructions supported (P4, Athlon64, Opteron)
  27      0x8000000   SSE3 instructions supported (forthcoming "Prescott")
  28      0x10000000  hyperthreading supported (P4)
  29      0x20000000  MMX extension instructions (AMD only)
  30      0x40000000  3DNow extension instructions (AMD only)
  31      0x80000000  3DNow instructions (AMD only)
----------------------------------------------------------------------------


extern "C" void ProcessorName (char * text);
--------------------------------------------
Makes a zero-terminated text string with a description of the microprocessor.
The string is stored in the parameter "text", which must be a character array
of size at least 68.
 

extern "C" int ReadClock (void);
--------------------------------
This function returns the value of the internal clock counter in the 
microprocessor. To count how many clock cycles a piece of code takes, call
ReadClock before and after the code to measure and calculate the difference.
You may see that the count varies a lot because you may not be able to prevent
interrupts during the execution of your code. See 
www.agner.org/assem/pentopt.pdf for discussions of how to measure execution
time most accurately. The ReadClock function itself takes approximately 700
clock cycles on a Pentium 4, and approximately 225 clock cycles on Pentium II
and Pentium III. Does not work on 80386 and 80486.


List of files contained in asmlib.zip
=====================================

asmlib.txt    This file
asmlibM.lib   Function library in MS-COFF format
asmlibO.lib   Same function library in OMF format
asmlibE.a     Same function library in ELF format
*.obj         Object files for Borland Delphi (OMF format)
asmlib.h      Header file to include in C++ files
round.asm     Source code for function Round
truncate.asm  Source code for function Truncate
minmaxi.asm   Source code for function MinI and MaxI
minmaxd.asm   Source code for function MinD and MaxD
instrset.asm  Source code for function InstructionSet
detectpr.asm  Source code for function DetectProcessor, ProcessorName
rdtsc.asm     Source code for function ReadClock
makefile      Makefile for building the library

