Wednesday, October 31, 2007

What is a protected abstract virtual base pure virtual private destructor?

A protected abstract virtual base pure virtual private destructor.

This is one of the funny question and very less answered one. It may be a very long sentence. But the code needed for making a protected abstract virtual base pure virtual private destructor is quite simple.
The below code is the one which makes a protected abstract virtual base pure virtual private destructor

Program:

class BaseClass // An abstract class
{
public:
virtual void MakeAbstract() = 0;
};

// A class derived as an protected abstract virtual base
class AbstractBase : virtual protected BaseClass
{
private:
void MakeAbstract(){;}
// A pure virtual private destructor
virtual ~AbstractBase() = 0;
friend class Derived;
};

AbstractBase::~AbstractBase()
{

}
class Derived : protected AbstractBase
{

};

int main(int argc, _TCHAR* argv[])
{
// You can definitely make an object of class Derived
Derived obj;
return 0;
}


Explanation:

In the above program AbstractBase::~AbstractBase() can be called as a protected abstract virtual base pure virtual private destructor. Let us see how it can be called so.

1. The class AbstractBase is derived as "virtual protected" from an "abstract base" class. So we can call class AbstractBase as a "protected abstract virtual base".

2. Now let us check the destructor of class AbstractBase. It is made as a "pure virtual private" one. So we can call it as a "pure virtual private destructor".

3. Now combining both, we can call the destructor of AbstractBase as "protected abstract virtual base pure virtual private destructor"


Use:

This question can measure the knowledge in C++. Practically it wont have much of a use in implementation point of view. This question was made to prove that C++ is too complex and weird. But when we see the code for such a big definition it serves in the opposite way. Its quite easy to write long sentences in very few line of codes.

RapidMind - Stream programming

Why RapidMind?

RapidMind helps us to introduce data parallelism in our program. Data parallelism can optimize the program speed to a big extent. RapidMind gives data parallelism by using stream computing.
Stream computing and Stream processors are nothing new to most of us right now. It helps us to execute some kernels(functions) on multiple data. Intel SSE, GPU etc are example of Stream Computing. Rapid mind as of the data supports both Cell BE( An IBM Cell architecture ) and GPU( like nVIDIA, ATI ) and will be supporting X86 architecture in near future.
It is quite easy to convert your serial program to a stream program using RapidMind. RapidMind is purely implemented in C++. Every things are wrapped into a namespace RapidMind. So the development also becomes easy.

Now I will quote an example for converting a normal serial program to a RapidMind program.

The below programs does operations on 4 floating point values. The operations are done on 16 bytes. Let us see how the implementation differs for a RapidMind program and a normal program.

Normal program

float SquareofIndividualSquare( float a, float b )
{
return(( a*a + b*b ) * ( a*a + b*b ));
}


int _tmain(int argc, _TCHAR* argv[])
{
float* fFirstElement = new float[2048*2048*4];
float* fSecondElement = new float[2048*2048*4];
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
nIndex = 0;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++,nIndex++ )
{
fFirstElement[nIndex] = SquareofIndividualSquare( fFirstElement[nIndex],fSecondElement[nIndex] );
}
}
}
}

RapidMind Program

#include
using namespace rapidmind;

int main()
{
// Do the initialization of rapid mind platform
rapidmind::init();
// Since GPU is used set the backend as GLSL( OpenGL shader )
use_backend("glsl");

// Array is template class.
// Value4f means 4 floats per each element
Array<2,Value4f> a(2048,2048);
Array<2,Value4f> b(2048,2048);

// This is how we get access to the actual array location.
// Now we can use these pointer to -
// manipulate internal data using CPU.
float* fFirstElement = a.write_data();
float* fSecondElement = b.write_data();

// Fill the input arrays
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}

// This array can get the output data. A normal array.
Array<2,Value4f> output;

// The stream program that will be executed on the data
// This will be executed on GPU.
Program prg = RM_BEGIN {
In a; // First input
In b; // Second input
Out c; // Output

c = (a*a + b*b)*(a*a+b*b); // Data manipulation
} RM_END;

// Execute the stream program
// The output will be available in output array.
output = prg(a, b);
}



Description:


We can see that in the rapid mind program the internal for loop can be replaced. The multiplication of 4 floats are done with one line. This is the advantage of using stream computing. You can process more than one data at a time.


Important:

When we check the performance of the above program we will find the CPU is giving high performance. But this wont be the case if we do a lots of processing. The CPU gives better performance for the programs with very less processing on data because of CPU caching and memory speed. But if we have a chunk of data and we need to do a lot of process on that data the RapidMind will be the best option. Also we are expecting a x86 version of RapidMind. If an x86 version is available it may take out this problem also.

Saturday, October 27, 2007

Performance analysis C++ vs C#

Performance analysis C++ vs C#

Description

C++ or C#, Which is the best language?
This question have a clear answer if you are thinking in performance point of view. In the performance area C++ zigs where C# zags.

Let us take a small example,

Here I am doing some matrix operations using C++ and C#. Both are executing same algorithm.

C++ Program

int nRetCode = 0;
const int nSize = 500;
int* nMatrix1 = new int[nSize*nSize];
int* nMatrix2 = new int[nSize*nSize];
int* nMultipliedMatrix = new int[nSize*nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize+j] = i;
nMatrix2[i*nSize+j] = j;
}
}
int nElapsed = 0;
int nLoopCount = 5;
for( int nVal = 0; nVal < nLoopCount; nVal++ )
{
int nStart = GetTickCount();
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i*nSize+j] = nMatrix1[i*nSize+k] + nMatrix2[k*nSize+j];
}
}
}
nElapsed += GetTickCount()- nStart;
}
delete[] nMatrix1;
delete[] nMatrix2;
delete[] nMultipliedMatrix;
std::cout << nElapsed / nLoopCount;
return nRetCode;

C# Program
class Program
{
public const int nSize = 500;
static void Main(string[] args)
{
int[] nMatrix1 = new int[nSize * nSize];
int[] nMatrix2 = new int[nSize * nSize];
int[] nMultipliedMatrix = new int[nSize * nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize + j] = i;
nMatrix2[i*nSize + j] = j;
}
}
int nLoopCount = 5;
int nElapsed = 0;
for (int nVal = 0; nVal < nLoopCount; nVal++)
{
int nStart = Environment.TickCount;
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i * nSize + j] = nMatrix1[i * nSize + k] + nMatrix2[k * nSize + j];
}
}
}
nElapsed += Environment.TickCount - nStart;
}
Console.WriteLine(nElapsed/nLoopCount);
}
}


Above program does some basic matrix operations. While checking the performance of same algorithm implemented using C++ and C# it can be understood that the C++ is giving an excellent performance.
When this program was ran on an Intel Pentium 4 3.2Ghz machine with 1GB RAM the time taken was as follows.

C++ code(Average of 5 execution) = 785ms.
C# code(Average of 5 execution) = 1465ms.

So clearly we can understand that the C++ outplays C# in the case of performance. Even in executing a basic algorithm without many of the OOP like overloading, runtime polymorphism the program is taking this much of performance loss.
But c# have many other advantages like maintainability, understandability etc. But all these comes at the cost of performance.

SIMD - Intel SSE( Streaming SIMD Extension )

Intel SSE - Streaming SIMD Extension

What is SSE?

SSE is a an instruction set which has 4 series. SSE, SSE2, SSE3, SSE4. These instructions work on 128bit registers called XMM register. So the application can even grow 4 times faster. The instruction set can be downloaded from intel website directly. There are two different concepts in SSE which lets you read and write a block of data in memory.

Prefetching: The prefetching helps you to cache the data from memory before the use of it comes. You can cache the data to different caches as you select.

Non-temporal storing - The non temporal storing helps you to write data to memory bypassing the cache. This can help you to avoid cache polution.

Example:

An example is given below which does memory copy of data. The ordinary memcpy function will copy data as 4byte blocks in the best case. But using SSE instruction we can copy 16bytes of data together. And normal memory copy polutes the cache whereas the SSE instruction can bypass the cache. The following code is just a stub.

Normal memory copy:

mov ecx, count
shr ecx, 1 // copying 2bytes at a time
mov esi, source
mov edi, destination
rep movsd // moves the data from source to destination. the ecx will be used as number of bytes

SSE memory copy:

mov ecx, count
shr ecx, 4 // copying 16bytes at a time
mov esi, source
prefetchnta [esi] // prefetching the data from cache
mov edi, destination
cmp ecx, 0
jz END

NEXT:
movdqa xmm0, [esi] // reading the data from memory expecting memory is 16 byte aligned
movntdq [edi], xmm0 // writing data directly to memory bypassing cache expecting destination memory is 16 byte aligned
cmp ecx, 0
jnz NEXT


I have written a small code block which copies memory using SSE and without SSE. The code with SSE will work very much faster than the one witout SSE.

So if the program can be data parallelized SSE instructions can improve the performance of a program quite heavily.