I saw a post on the internet yesterday saying that the SSE implementation of inverse square root was 4x faster than the ‘marvelous‘ method of computing it. My internal thought process was along the lines of: “Of course it is! It’s going 4 floats at a time!” and so I decided… someone on the internet MAY be wrong!

So I spent an ill-advised hour or so trying to figure out how to get vector intrinsics to work with the version of GCC that I had installed.

I eventually ended up with code that generated OK-looking x86 assembly (although it does make me miss the PS3/360 vector intrinsics!):

void inv_sqrt_marvelous_sse(float* begin, float* end, float* output) { v4sf onePointFiveFloat = {1.5f, 1.5f, 1.5f, 1.5f}; v4sf zeroPointFiveFloat = {0.5f, 0.5f, 0.5f, 0.5f}; v4si intConst = {0x5f3759df, 0x5f3759df, 0x5f3759df, 0x5f3759df}; v4si intShift = {0x1u, 0x1u, 0x1u, 0x1u}; union { v4sf f; v4si i; } x; v4sf *v4out = (v4sf*)output; for(; begin < end; begin+=4, v4out++){ v4sf orig = *((v4sf*)begin); v4sf half = zeroPointFiveFloat * orig; x.f = half; x.i = intConst - (x.i >> intShift); *v4out = x.f * (onePointFiveFloat - half * x.f * x.f); } }

I didn’t actually get the >> operator to work: my version of gcc on Mac is a bit out of date and so doesn’t support all of the magic vector operations (despite the fact someone may be wrong, I decided that potentially breaking XCode would be taking things too far). So I subbed in a cheap nop and… those SSE implementers made that approximate inverse square root pretty fast! The marvelous version naively vectorized as above was about 2x slower with what I guess would be around an order of magnitude more error. It’d be a lot harder to tell and I’d have to use a much better benchmark than the one I was using to say for sure, but it was enough for me to concede. Regardless… go SSE!

So in this case… I was wrong on the internet! :) At least I learned a bit about gcc intrinsics along the way.

(Update: this guy went into it in much more thorough detail if you’re interested!)