Wednesday, October 31, 2007

RapidMind - Stream programming

Why RapidMind?

RapidMind helps us to introduce data parallelism in our program. Data parallelism can optimize the program speed to a big extent. RapidMind gives data parallelism by using stream computing.
Stream computing and Stream processors are nothing new to most of us right now. It helps us to execute some kernels(functions) on multiple data. Intel SSE, GPU etc are example of Stream Computing. Rapid mind as of the data supports both Cell BE( An IBM Cell architecture ) and GPU( like nVIDIA, ATI ) and will be supporting X86 architecture in near future.
It is quite easy to convert your serial program to a stream program using RapidMind. RapidMind is purely implemented in C++. Every things are wrapped into a namespace RapidMind. So the development also becomes easy.

Now I will quote an example for converting a normal serial program to a RapidMind program.

The below programs does operations on 4 floating point values. The operations are done on 16 bytes. Let us see how the implementation differs for a RapidMind program and a normal program.

Normal program

float SquareofIndividualSquare( float a, float b )
{
return(( a*a + b*b ) * ( a*a + b*b ));
}


int _tmain(int argc, _TCHAR* argv[])
{
float* fFirstElement = new float[2048*2048*4];
float* fSecondElement = new float[2048*2048*4];
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
nIndex = 0;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++,nIndex++ )
{
fFirstElement[nIndex] = SquareofIndividualSquare( fFirstElement[nIndex],fSecondElement[nIndex] );
}
}
}
}

RapidMind Program

#include
using namespace rapidmind;

int main()
{
// Do the initialization of rapid mind platform
rapidmind::init();
// Since GPU is used set the backend as GLSL( OpenGL shader )
use_backend("glsl");

// Array is template class.
// Value4f means 4 floats per each element
Array<2,Value4f> a(2048,2048);
Array<2,Value4f> b(2048,2048);

// This is how we get access to the actual array location.
// Now we can use these pointer to -
// manipulate internal data using CPU.
float* fFirstElement = a.write_data();
float* fSecondElement = b.write_data();

// Fill the input arrays
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}

// This array can get the output data. A normal array.
Array<2,Value4f> output;

// The stream program that will be executed on the data
// This will be executed on GPU.
Program prg = RM_BEGIN {
In a; // First input
In b; // Second input
Out c; // Output

c = (a*a + b*b)*(a*a+b*b); // Data manipulation
} RM_END;

// Execute the stream program
// The output will be available in output array.
output = prg(a, b);
}



Description:


We can see that in the rapid mind program the internal for loop can be replaced. The multiplication of 4 floats are done with one line. This is the advantage of using stream computing. You can process more than one data at a time.


Important:

When we check the performance of the above program we will find the CPU is giving high performance. But this wont be the case if we do a lots of processing. The CPU gives better performance for the programs with very less processing on data because of CPU caching and memory speed. But if we have a chunk of data and we need to do a lot of process on that data the RapidMind will be the best option. Also we are expecting a x86 version of RapidMind. If an x86 version is available it may take out this problem also.

No comments: