Tuesday, March 31, 2009

Speed with a Catch

A while back, I wrote a post about surface normals in OpenGL ES. Yesterday on Twitter, there was some discussion about using the inverse square root function from Quake 3 to speed up the performance of iPhone OpenGL ES applications. Here is what that method looks like (converted to using GL and iPhone data types):

static inline GLfloat InvSqrt(GLfloat x)
{
GLfloat xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(GLfloat*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}

The inverse square root can be used in several ways. Noel Llopis of Snappy Touch pointed out two uses for it on Twitter yesterday: calculating normals and doing spherical UV texture mapping. I'm still trying to wrap my head around the UV Texture Mapping, but I understand normals pretty well at this point, so I though I'd see what kind of performance gains I could get using this old optimization. There's all sorts of arguments around the intertubes about whether this function still gives performance gains, but there's an easy way to find out: use it and measure with Shark.

I used my Wavefront OBJ Loader as a test, and profiled the loading of the most complex of the three objects - the airplane. The first run was using my original code, which stupidly1 used sqrt(). I then re-ran it using sqrtf(), and then again using the Quake3D InvSqrt() function above.

The results were impressive, and you definitely do get a performance increase from using this decade-old function on the iPhone. Using InvSqrt() gave a 15% decrease in time spent calculating surface normals over using sqrtf() and a 40% decrease over calculating with sqrt(). That's not an amount to be sneezed at, especially in situations where you need to calculate normals on the fly many times a second.

Now, if you remember, this was how we calculated normals using the square root function from Math.h:

static inline GLfloat Vector3DMagnitude(Vector3D vector)
{
return sqrt((vector.x * vector.x) + (vector.y * vector.y) + (vector.z * vector.z));
}

static inline void Vector3DNormalize(Vector3D *vector)
{
GLfloat vecMag = Vector3DMagnitude(*vector);
if ( vecMag == 0.0 )
{
vector->x = 1.0;
vector->y = 0.0;
vector->z = 0.0;
}

vector->x /= vecMag;
vector->y /= vecMag;
vector->z /= vecMag;
}

So... how can we tweak this to use inverse square root? Well, the inverse square root of a number is simply 1 divided by the square root of that number. In Vector3DNormalize(), we divide each of the components of the vector (x,y,and z) by the magnitude of the vector, which is calculated using square root. Since dividing a value by a number is the same as multiplying by 1 divided by that same number, so, we can just multiply each component by the inverse magnitude instead, like so:

static inline GLfloat Vector3DFastInverseMagnitude(Vector3D vector)
{
return InvSqrt((vector.x * vector.x) + (vector.y * vector.y) + (vector.z * vector.z));
}

static inline void Vector3DFastNormalize(Vector3D *vector)
{
GLfloat vecInverseMag = Vector3DFastInverseMagnitude(*vector);
if (vecInverseMag == 0.0)
{
vector->x = 1.0;
vector->y = 0.0;
vector->z = 0.0;
}

vector->x *= vecInverseMag;
vector->y *= vecInverseMag;
vector->z *= vecInverseMag;
}


Sweet, right? If we now use Vector3DFastNormalize() instead of Vector3DNormalize(), and each call will be about 15% faster on current generations of the iPhone and iPod Touch compared to using the built-in square root function.

But… there's a catch. Actually, two catches.

The Catches


The first catch is that this optimization doesn't work faster on all hardware. In fact, on some hardware, it is measurably slower than using sqrtf(). That means you're gambling that future hardware will also benefit from this same optimization. Not a huge deal and very possibly a safe bet, but you should be aware of it, and be prepared to back it out quickly should Apple release a new generation of iPhones and iPod Touches that use a different processor.

The second, and far more important catch is the possible legal ramifications of using this code. You see, Id released Quake3D's source code under the GNU Public License, which is a viral license. If you use source code from a GPL project, you have to open source your entire project under the GPL as well. Now, that's an oversimplification, and there are ways around the GPL, but as a general rule, if you use GPL'd code, you have to make your code GPL also.

But, the waters are a little murky. John Carmack has admitted that he didn't write that function, and doesn't think the other programmers at Id did either. The actual author of the code is unknown. Some of the contributors to the function have been found, but not the original author. That means the code MIGHT be in the public domain. If that's the case, its inclusion in a GPL application doesn't take it out of the public domain.

So, bottom line: is it safe to use? Probably. This function is widely known and widely used and there's been no indication that any possible rights owner has any interest in chasing down every use of this function. Are there any guarantees? Nope.

My recommendation is to use it, but make sure every place you use it, have a backup method that you can fallback on if you need to. If you want some assurance, you could try contacting Id legal and getting a waiver to use that function. I don't know if they'll respond, or if they'll grant it, but the folks at Id have always struck me as good people, so it might be worth an inquiry if you're risk averse.

1 - sqrt() is a double-precision function. Since OpenGL ES doesn't support the GLDouble datatype, which means I was doing the calculation on twice as many bits as needed, and converting back and forth from single to double precision then back again.



19 comments:

Joel Bernstein said...

Two more things scare me about this code.

1. It converts GLFloats to NSIntegers and back even though neither (I think) is guaranteed to always be exactly four bytes long, which strikes me as a likely cause of an impossible-to-diagnose truncation or zero padding bug a few years from now.

2. It bit-shifts a floating point number. I'd have to sit down with a pencil, a pad of paper, and a copy of the IEEE 754 spec just to even understand what the heck that line is doing, and vague comments about Newton's method don't help much. I don't get how you keep your sign bit from bleeding into your exponent into your mantissa.

Maybe there's logical reasons why this stuff doesn't matter, and I'm just not in the proper state of mind to think of them.

Zavie said...

if (vecInverseMag == 0.0)
{
vector->x /= 1.0;
vector->y /= 0.0;
vector->z /= 0.0;
}

I guess there's some typo here... :-)

Jeff LaMarche said...

Zavie:

Yep, you're right - those should be straight assignments, not divisions.

But an occasional division by zero is good for you. :)

Jeff

Jeff LaMarche said...

Joel:

#1 was my fault - shouldn't have changed int to NSInteger. I've gotten so used to using NSInteger, I didn't think about the consequences should we ever get a 64-bit chip on the iPhone. I fixed that.

As for #2, It doesn't actually bit-shift a floating point number, it bit-shifts an integer representation of a floating point number.

But, remember that the square of a number and the square of the negative of that number are the same , and the square root of a negative number is an imaginary number, so the loss of the sign bit simply doesn't matter in this scenario.

Is it precise? No, but the amount of imprecision doesn't seem to be enough to be noticed in a game. You're trading off accuracy for speed.

rectalogic said...

Did you compile the floating point version using -mno-thumb? XCode default is to use the Thumb instruction set which is slow for floating point intensive code.

Scott said...

This post is a bit stale, but I'd also like to point out that you are doing three divisions (which can be very very expensive) vs three multiplications (which are not).

Try calculating the inverse square root in both cases instead of dividing by the square root. I'd not necessarily expect the compiler to be smart enough to catch that.

Also, did you try the proper compiler flags such as -ffast-math and disabling thumb compilation?

windson said...

The inverse square root function was first used at SGI and iD must have copied it. The original algorithm was developed by Newton, so im pretty sure will be in public domain, although im not too sure how IP works.

Hamish said...

It's the code that GPL'ed, not the algorithm. Just take the algotithm and express it with your own code (i.e. Don't cut & paste).

5 lines isn't hard to re-write.

H

a random John said...

Hamish,

LOL!

That's 5 lines of C code that very few C programmers could replicate. How many C programmers could be given the two lines:
int i = *(int*)&x;
i = 0x5f3759d5 - (i >> 1);

and not get incredibly nervous?

Most C programmers and not comfortable with bit twiddling with floats. I say this as somebody that spent years bit twiddling in smart cards and embedded systems and you could give me a week to replicate those five lines and I doubt that I'd have something that worked just as well.

Wizermil said...

I wrote a small test to check how long the calculation of the inv square root takes with 3 diff methods:
1. Newton:
static inline float InvSqrt(float x) {
float xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}
2. 1/Square Root based on Newton
static inline float Sqrt(float x) {
float xhalf = 0.5f * x;
const float number = x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return number*x;
}
3. 1/Math.h square root

Below you will find my test:
const int iteration = 10000;
float test;
NSDate *startTime = [NSDate date];
for(int i=0;i<iteration;i++) {
test = InvSqrt(i);
}
NSLog(@"InvSqrt: %f", -[startTime timeIntervalSinceNow]);

startTime = [NSDate date];
for(int j=0;j<iteration;j++) {
test = 1.0f / Sqrt(j);
}
NSLog(@"1/Sqrt: %f", -[startTime timeIntervalSinceNow]);

startTime = [NSDate date];
for(int k=0;k<iteration;k++) {
test = 1.0f / sqrt(k);
}
NSLog(@"1/Sqrt: %f", -[startTime timeIntervalSinceNow]);
sqrtf.text = [NSString stringWithFormat:@"1/sqrt: %f", -[startTime timeIntervalSinceNow]];

But when I run this small test the result seems weird.
(BTW I unchecked the thumb compilation for the test and I used a 3G OS 3.3.1)
The Math.h is stable always around 0.000028 and it seems have a faster calculation than the others solutions.
For the 2 others the cost of the div extra cycles seems to have no impact on the calculation time.

Do you have any explanations about this results ? Maybe I did something wrong about it, I didn't think at a cache memory, or it could be due to data alignment ??

Greg said...

@Wizermil, beware of compile-time optimizations by gcc.

Meaning, if the result of your tests can be predetermined at compile time, gcc will do it, and you won't be testing squat at that point.

Some solutions:

- Ask for values at runtime
- Use gcc flags to turn off all optimizations

Wizermil said...
This comment has been removed by the author.
Wizermil said...

@Greg Of course !!! I'm so stupid I forget gcc's optimsations

BTW I now make the calculation based on random value and Math.h implementation is still faster (Thumb off and GCC Optim off) I don't know yet why but a colleague of mine told me that it can be a cache mechanism implemented on the cpu to optimize calculation of operations.

I have not yet figure out what is the explanation but I'm still working on it. If I discover an answer I'll post i.
However if someone already look at it, I'll be glad to learn something new.

Thanks

Mike said...

I believe a safer way to implement this is to avoid dereferencing the type-punned pointer using a simple struct. This should (I believe) allow the compiler to worry about the float value.

I wrote a little about this here: http://blog.reloadsystems.net/2009/11/08/avoid-dereferencing-in-carmacks-implementation-of-approximate-roots/

But this is what it boils down to:

union _data32 {
long l;
float f;
} data32;

float Math::rsqrt( const float value ) {
// Implementation of the Newton Approximation of roots.
long i;
float x, y;
const float f = 1.5f;
x = value * 0.5f;
y = value;
// i = *(long *) &y;
data32.f = y;
i = data32.l;
i = 0x5f3759df - (i>> 1);
// y = *(float *) &i;
data32.l = i;
y = data32.f;
y = y * (f - (x * y * y));
// Uncomment to calculate sqrt
// y = y * (f - (x * y * y));
return y;
}

I have been using this in game physics heavy code professionally without a problem for a while now (including the iPhone).

Hope this helps.

Stefan said...

The VFP's sqrt/div-unit may be somewhat slow, but the CPU can continue working while the VFP is crunching. So you can use the built-in sqrtf() very efficiently if you interleave some other operations after the sqrt and div operations to hide the latency:

float length = sqrtf(...)
// --> do something else to keep the CPU busy
float invLength = 1/length; // pre-compute inverse - the compiler won't do this
// --> do something else to keep the CPU busy
normal.x = vector.x * invLength;
normal.y = vector.y * invLength;

The resulting CPU load for the inverse sqrt is comparable to two small FP Ops (add,sub,mul). Whereas the Newton approximation takes 6 such Operations.

Richard said...

viagra online
generic viagra

Edwin said...

scrub m65 kamagra attorney lawyer body scrub field jacket lovegra marijuana attorney injury lawyer

JeansPilot said...

JeansPilot offers the chance to buy a large variety of men’s and women’s jeans clothing from the world famous Italian Brands.
Online jeans clothing store looks for original fashion clothing sales and clearances of worldwide known designers. We participate in fashion auctions to get the lowest possible price for Top quality Clothes, Shoes and Accessories.
Buy Jeans

h4ns said...

What youre saying is completely true. I know that everybody must say the same thing, but I just think that you put it in a way that everyone can understand. I also love the images you put in here. They fit so well with what youre trying to say. Im sure youll reach so many people with what youve got to say.

Arsenal vs Huddersfield Town live streaming
Arsenal vs Huddersfield Town live streaming
Wolverhampton Wanderers vs Stoke City Live Streaming
Wolverhampton Wanderers vs Stoke City Live Streaming
Notts County vs Manchester City Live Streaming
Notts County vs Manchester City Live Streaming
Bologna vs AS Roma Live Streaming
Bologna vs AS Roma Live Streaming
Juventus vs Udinese Live Streaming
Juventus vs Udinese Live Streaming
Napoli vs Sampdoria Live Streaming
Napoli vs Sampdoria Live Streaming
Fulham vs Tottenham Hotspur Live Streaming
Fulham vs Tottenham Hotspur Live Streaming
AS Monaco vs Marseille Live Streaming
AS Monaco vs Marseille Live Streaming
Alajuelense vs Perez Zeledon Live Streaming
Alajuelense vs Perez Zeledon Live Streaming
Technology News | News Today | Live Streaming TV Channels