I ran across this
http://cache-www.intel.com/cd/00/00/01/76/17699_code_zohar.pdf which is a math library using SSE2 to do fast math operations. I spent a lot of time upgrading his code to be more suitable to games, such as doing a transpose for an inverse of a orthonormal matrix. But after profiling I found his code was slower than the equivalent D3DX math functions, even for matrix multiply.
D3DXMatrixMultiply 10000000 times:
diffGP=564 milliseconds diffDX=396 milliseconds
I also tried this: http://www.cs.nmsu.edu/CSWS/techRpt/2003-003.ps
It appears that D3DX already does better than this as well:
Theirs=695 milliseconds Mine=801 milliseconds
However, the version without scaling was 100 milliseconds faster. However, that is such a special case it’s not worth leaving in.
So Kudos to Direct3D because their math functions are very fast!
By the way, this is something I was able to figure out while experimenting. In every library I’ve ever used, except the one at The Collective, this was very unclear. I think they used to always store the matrices transposed to make them easier to use or something.
inline D3DXVECTOR3 * GetAtVec(D3DXVECTOR3 *out, D3DXMATRIX *in)
inline D3DXVECTOR3 * GetUpVec(D3DXVECTOR3 *out, D3DXMATRIX *in)
inline D3DXVECTOR3 * GetRightVec(D3DXVECTOR3 *out, D3DXMATRIX *in)
inline D3DXVECTOR3 GetAtVec(D3DXMATRIX *in)
inline D3DXVECTOR3 GetUpVec(D3DXMATRIX *in)
inline D3DXVECTOR3 GetRightVec(D3DXMATRIX *in)