How embedded C++ differ from ISO C++?
Embedded C++(EC++) language is a subset of ISO C++ and a program written in EC++ can be compiled with any ISO C++ compiler. EC++ is formed by omitting following features of ISO C++.
1. Multiple Inheritance
2. Virtual base classes
3. Run Time Type Information
4. New style casts
5. The mutable type qualifier
6. Namespaces
7. Exceptions
8. Templates
Now let’s have a look on the facts of omitting these usages.
1. Multiple Inheritance
The reason to avoid multiple inheritance is quite simple. It is complex and not easy for even an expert programmer to design a class using multiple inheritance. But there are many superb interface hierarchies which we can form using multiple inheritance. The right way to avoid complexity is by setting rules instead of taking a feature out. By allowing and disallowing different hierarchies through a guideline is perfectly enough to avoid the confusions due to multiple inheritance.
2. Virtual base classes
If there is no multiple inheritance why would we need a virtual base class? Or else to avoid virtual base classes we have to omit multiple inheritance J.
3. Run Time Type Information
Run Time Type information can cause program size overhead because type information of the polymorphic classes is needed. It will only be advantageous for a program which is heavily polymorphic and it adds no merit to a program which is not much polymorphic. But every compiler including gcc has a compiler flag to either enable or disable RTTI. Any novice programmer can set it to off and now what is the need for such a omitting instead of a guide line not to use RTTI?
4. New style casts
Since EC++ omits RTTI the dynamic cast wont work. Even though EC++ don’t support new style casts static_cast will be valid. These alerting are either useless or artifact of disabling the RTTI.
5. The mutable type qualifier
It could rather be a guideline not to use mutable if the object is made as const than taking out such a feature. One of the best examples to understand the advantage of guidelines is to see there are hardly any goto statements in the programs even made by novices. The omitting of mutable also adds no advantage to a language for embedded system.
6. Namespaces
This is one of the weirdest omitting from the C++ language. The reason to omit namespaces is pretty simple because it is “EC++ committee thinks it is not essential to have namespaces”.
7. Exceptions
There are 2 major reasons to avoid exception handling in EC++.
1. It is difficult to estimate the time between when an exception has occurred and control has passed to a corresponding exception handler.
2. It is difficult to estimate memory consumption for exception handling.
Even in a real time environment what is the big deal in time between exception has occurred and caught? Even though there is unpredictability in throughput of exception handling it is always better not to crash your program or writing your own exception handling mechanisms. And still if exception handling need not be done in a program or at an environment it must be a guideline rather than freezing a feature of language.
8. Templates
The disadvantages of template are that it can introduce code bloat and the increase in program size. But templates make the development and code maintainability far better. Even template specialization or limiting to necessary types can be done for every class to avoid code bloat. To avoid careless usage of templates the proper method is never to take it out of the way.
Conclusion
It is always better to make guidelines rather than making a new language with subset of an existing one. To see the Stroustrup’s comments on embedded c++ can be found at Stroustrup's FAQ.
Wednesday, May 14, 2008
Wednesday, April 23, 2008
Virtual destructors (What? When? Why?)
There are some questions regarding virtual destructor that I am asked very frequently. Answers to all these questions are similar so I am making a common note on it.
• What is the use of virtual destructor?
• When to make base class destructor virtual?
• Is there any overhead for virtual destructor?
• Why don’t we make all the destructor virtual?
• Will the delete operator be overridden?
• Will the delete[] operator be overridden?
What is the use of virtual destructor?
The virtual destructor comes into play when the class is a base class. If the derived class object is deleted using a base class pointer the necessary destruction of the derived class would not happen if only base class destructor is called. To simplify, if the base class destructor is not virtual and if we try to delete the derived class object using base class pointer the derived class destructor will not be called. It’s known that the destruction is done from derived to base; hence if the derived class destructor is called it will call the base class destructor and the proper destruction of the object will be happening. Altogether by making the base class destructor virtual we are making destructor to be capable of overriding.
When to make base class destructor virtual?
Make the destructor of the class virtual if the class has any virtual function. It should probably save because of 2 reasons.
1. Most of the real world base classes will have at least one virtual function.
2. In the real world problems if there is no virtual function in the base class there is no specific advantage in using base class pointer.
Is there any overhead for virtual destructor?
The per-object-space-cost will be nothing because we only make our destructor private when there is at least one virtual method in the base class. When the first virtual method in the class is made; the per-object-space-cost will be paid off and this makes it costless to make the destructor virtual.
Why don’t we make all the destructor virtual?
If there is no virtual function in the base class and you make the destructor virtual it increases the per-object-space-cost for no added advantage. It is safe not to make the destructor if you class,
1. is not a base class
2. or is not having any virtual function
Will the delete operator be overridden?
Yea, the standards state the delete operator will be overridden and the derived class delete operator will be used for the destruction of derived class.
Will the delete[] operator be overridden?
NO, the standards explicitly states the delete[](deleting array of objects) operator will not be overridden and the operator corresponding to the type of pointer will be used for deleting.
For better understanding on delete and delete[] operator with virtual destruction please read the post Virtual Destructor delete and delete[]
• What is the use of virtual destructor?
• When to make base class destructor virtual?
• Is there any overhead for virtual destructor?
• Why don’t we make all the destructor virtual?
• Will the delete operator be overridden?
• Will the delete[] operator be overridden?
What is the use of virtual destructor?
The virtual destructor comes into play when the class is a base class. If the derived class object is deleted using a base class pointer the necessary destruction of the derived class would not happen if only base class destructor is called. To simplify, if the base class destructor is not virtual and if we try to delete the derived class object using base class pointer the derived class destructor will not be called. It’s known that the destruction is done from derived to base; hence if the derived class destructor is called it will call the base class destructor and the proper destruction of the object will be happening. Altogether by making the base class destructor virtual we are making destructor to be capable of overriding.
When to make base class destructor virtual?
Make the destructor of the class virtual if the class has any virtual function. It should probably save because of 2 reasons.
1. Most of the real world base classes will have at least one virtual function.
2. In the real world problems if there is no virtual function in the base class there is no specific advantage in using base class pointer.
Is there any overhead for virtual destructor?
The per-object-space-cost will be nothing because we only make our destructor private when there is at least one virtual method in the base class. When the first virtual method in the class is made; the per-object-space-cost will be paid off and this makes it costless to make the destructor virtual.
Why don’t we make all the destructor virtual?
If there is no virtual function in the base class and you make the destructor virtual it increases the per-object-space-cost for no added advantage. It is safe not to make the destructor if you class,
1. is not a base class
2. or is not having any virtual function
Will the delete operator be overridden?
Yea, the standards state the delete operator will be overridden and the derived class delete operator will be used for the destruction of derived class.
Will the delete[] operator be overridden?
NO, the standards explicitly states the delete[](deleting array of objects) operator will not be overridden and the operator corresponding to the type of pointer will be used for deleting.
For better understanding on delete and delete[] operator with virtual destruction please read the post Virtual Destructor delete and delete[]
Friday, April 11, 2008
The development of NULL - A history
Introduction
Null pointer is a stuff which have had changes and advancement through out the development of C and C++. All the implementation had its own draw backs and now a perfect design is about to arrive in C++0x. Let us analyze the NULL pointer in C, C++ and C++0x.
K&R style(C Style)
A NULL pointer is a constant expression which evaluates to either 0 or ((void*) 0). There are machines which use different internal representations for pointers to different type. In such cases by standard it is always guaranteed a 0 cast to void* will be assignable to every pointer. For example if you assign a ((void*) 0) to a pointer of FILE type it is guaranteed to be initialized to a null pointer without any error.
C++ style(C++98)
A null pointer constant is an integral constant expression rvalue of an integer type that evaluates to zero. A null pointer constant can be converted to a pointer type; the result is the null pointer value of that type and is distinguishable from every other value of a pointer. So the macro NULL is equivalent to an integer 0 and therefore it is better to avoid NULL macro by directly using a 0.
C++0x style
In all the standards till now the constant 0 had double roles of constant integer 0 and null pointer. This behavior existed throughout the development of C and C++. In C the NULL is a macro which assigns to either 0 or ((void*) 0). But in C++ NULL is always a special case represented as 0. But even using a 0 have its own drawback while overloading. For example we have two declarations,
void foo( char* p );
void foo( int p );
and then call foo(NULL). This will call the integer overload of foo where a programmer may normally intend to call char overload with a null pointer.
To get rid of this issue the new standards will include an additional keyword ‘nullptr’ which would only be assignable to pointer types and comparable to pointer types. The existing 0 will have to suffer the double role again to have backward compatibility. But sooner or later C++ committee would declare deprecated usage of 0 and NULL as null pointers, and eventually avoid this double role.
Null pointer is a stuff which have had changes and advancement through out the development of C and C++. All the implementation had its own draw backs and now a perfect design is about to arrive in C++0x. Let us analyze the NULL pointer in C, C++ and C++0x.
K&R style(C Style)
A NULL pointer is a constant expression which evaluates to either 0 or ((void*) 0). There are machines which use different internal representations for pointers to different type. In such cases by standard it is always guaranteed a 0 cast to void* will be assignable to every pointer. For example if you assign a ((void*) 0) to a pointer of FILE type it is guaranteed to be initialized to a null pointer without any error.
C++ style(C++98)
A null pointer constant is an integral constant expression rvalue of an integer type that evaluates to zero. A null pointer constant can be converted to a pointer type; the result is the null pointer value of that type and is distinguishable from every other value of a pointer. So the macro NULL is equivalent to an integer 0 and therefore it is better to avoid NULL macro by directly using a 0.
C++0x style
In all the standards till now the constant 0 had double roles of constant integer 0 and null pointer. This behavior existed throughout the development of C and C++. In C the NULL is a macro which assigns to either 0 or ((void*) 0). But in C++ NULL is always a special case represented as 0. But even using a 0 have its own drawback while overloading. For example we have two declarations,
void foo( char* p );
void foo( int p );
and then call foo(NULL). This will call the integer overload of foo where a programmer may normally intend to call char overload with a null pointer.
To get rid of this issue the new standards will include an additional keyword ‘nullptr’ which would only be assignable to pointer types and comparable to pointer types. The existing 0 will have to suffer the double role again to have backward compatibility. But sooner or later C++ committee would declare deprecated usage of 0 and NULL as null pointers, and eventually avoid this double role.
Thursday, February 21, 2008
A volatile reference and const reference
A volatile reference
void foo(volatile double& bar )
{
cout << bar << endl;
}
Above function accepts a reference to double. What happens if you call,
int nVal = 0;
Foo( nVal );
An error will be generated by the compiler specifying the conversion from ‘ int ’ to ‘ volatile double & ’ is not possible.
Why it shouldn’t cast?
A casting has to be don when an integer need to be passed to a method which takes double as input. When we do a standard conversion from int to double a temporary object will be created with the help of implicit conversion. A function which takes volatile reference parameter can change the value of the parameter. Now let us map these two things together. When we do the implicit conversion a temporary object is created and if we pass that temporary object to a method which accepts volatile reference what happens? The method may modify the temporary object which will not affect the original one. For example,
void foo(volatile double& bar )
{
bar = 0;
}
int nVal = 10;
foo(nVal);
cout << nVal;
What we expect at the output? Here we intended to set the value of nVal as 0. But what happens if the above code runs? An implicit conversion has to be done from into to double and the resultant temporary is passed to method foo. So function foo sets the value of temporary object instead of nVal. Now it is great to see why the function with reference to double doesn’t compile while called with int.
A constant reference
void foo(const double& bar )
{
cout << bar << endl;
}
Now what should happen? Function foo accepts a const reference to double which in terms guarantees no update to the parameter bar. In this case it is safe to pass a temporary object. So the compiler will allow the following call.
int nVal = 0;
Foo( nVal );
Conclusion
If a temporary object could be used for calling a volatile reference function there might have been many hard to find bugs.
void foo(volatile double& bar )
{
cout << bar << endl;
}
Above function accepts a reference to double. What happens if you call,
int nVal = 0;
Foo( nVal );
An error will be generated by the compiler specifying the conversion from ‘ int ’ to ‘ volatile double & ’ is not possible.
Why it shouldn’t cast?
A casting has to be don when an integer need to be passed to a method which takes double as input. When we do a standard conversion from int to double a temporary object will be created with the help of implicit conversion. A function which takes volatile reference parameter can change the value of the parameter. Now let us map these two things together. When we do the implicit conversion a temporary object is created and if we pass that temporary object to a method which accepts volatile reference what happens? The method may modify the temporary object which will not affect the original one. For example,
void foo(volatile double& bar )
{
bar = 0;
}
int nVal = 10;
foo(nVal);
cout << nVal;
What we expect at the output? Here we intended to set the value of nVal as 0. But what happens if the above code runs? An implicit conversion has to be done from into to double and the resultant temporary is passed to method foo. So function foo sets the value of temporary object instead of nVal. Now it is great to see why the function with reference to double doesn’t compile while called with int.
A constant reference
void foo(const double& bar )
{
cout << bar << endl;
}
Now what should happen? Function foo accepts a const reference to double which in terms guarantees no update to the parameter bar. In this case it is safe to pass a temporary object. So the compiler will allow the following call.
int nVal = 0;
Foo( nVal );
Conclusion
If a temporary object could be used for calling a volatile reference function there might have been many hard to find bugs.
Tuesday, February 12, 2008
Compromising quality for performance
It has always been a big deal to compromise accuracy for performance. In most of the cases highly complex and time taking application will need highest accuracy. The accuracy has always been the problem in using GPU for performance improvement of such algorithms. Since the GPUs don’t support double precession arithmetic it looks hard to achieve high precision with it. CPU does floating point division using double precision arithmetic. But even the latest GPU from nVIDIA(8800) uses reciprocal multiplication with single precession for division.
There will be situations when you have to deal with very convoluted shapes. In such cases it becomes hard to settle down for floating point accuracy. In other way if CPU is used it might be impractical to get the algorithm working at real-time. In such cases the tough question comes. Can accuracy be compromised for performance?
If the answer is Yes!
If it is not so important to get the highest accuracy we can of course go for GPU. The massive computation power can be used to get the algorithm executed in real-time. It becomes an easy way of optimizing your algorithm by allowing a % of tolerance to the output. In this case you must be sure that the tolerance comes into a range which makes the algorithm usable.
If it is a Big No!
Here comes the problem. You have an algorithm which is not executable in real-time because of less computation power you have with available resource. You must settle for the single precision arithmetic with GPU. Now what can we do to improve performance with highest accuracy. An idea is to use both CPU and GPU for the execution. At first run all the parallel code using GPU and calculate the output. Then calculate the tolerance using a CPU version of the code which gives highest accuracy. Now do some very less amount of iterations of your algorithm using CPU to find the best value.
A case study
Suppose you have to do registration between two 3D surfaces. At first calculate the parameters needed for registering both surface using GPU. It may take a lot of iteration to find the rotation, translation, scaling and shearing parameters. When the correct registration parameters are found using GPU do some iteration with the CPU to find the best convergence. So now you can achieve performance improvement by doing more number of iterations in GPU. The accuracy is also good since we done a CPU based calculation at the end with the help of approximate parameters calculated using GPU. This strategy can be used in most of the cases where highest amount of accuracy is needed.
There will be situations when you have to deal with very convoluted shapes. In such cases it becomes hard to settle down for floating point accuracy. In other way if CPU is used it might be impractical to get the algorithm working at real-time. In such cases the tough question comes. Can accuracy be compromised for performance?
If the answer is Yes!
If it is not so important to get the highest accuracy we can of course go for GPU. The massive computation power can be used to get the algorithm executed in real-time. It becomes an easy way of optimizing your algorithm by allowing a % of tolerance to the output. In this case you must be sure that the tolerance comes into a range which makes the algorithm usable.
If it is a Big No!
Here comes the problem. You have an algorithm which is not executable in real-time because of less computation power you have with available resource. You must settle for the single precision arithmetic with GPU. Now what can we do to improve performance with highest accuracy. An idea is to use both CPU and GPU for the execution. At first run all the parallel code using GPU and calculate the output. Then calculate the tolerance using a CPU version of the code which gives highest accuracy. Now do some very less amount of iterations of your algorithm using CPU to find the best value.
A case study
Suppose you have to do registration between two 3D surfaces. At first calculate the parameters needed for registering both surface using GPU. It may take a lot of iteration to find the rotation, translation, scaling and shearing parameters. When the correct registration parameters are found using GPU do some iteration with the CPU to find the best convergence. So now you can achieve performance improvement by doing more number of iterations in GPU. The accuracy is also good since we done a CPU based calculation at the end with the help of approximate parameters calculated using GPU. This strategy can be used in most of the cases where highest amount of accuracy is needed.
Monday, February 11, 2008
Did Herb Sutter fight with the Amdahl’s law?
Understanding Amdahl’s law
No matter how much speedup you get for the parallel code, it is impossible to make a speedup of 2x if 75% of the algorithm cannot be parallelized. Suppose if you do the other 25% with 0 seconds that makes you gain up to 1.3x. For quantifying, if your algorithm takes 1 second and if 0.75 second of it is non-improvable, it doesn’t matter how much improvement you make for other 0.25 seconds, you cant get the total time scaled down to half.
Understanding Gustafson’s law
Rather than speedup Gustafson takes work into consideration. It states if your algorithm takes 1 second to complete and you have 0.75 second non-improvable part you can still get 2x speedup if you add a new feature which takes another 1 second in which 0.75 second is improvable. It means in the total 2 seconds, 1 second (0.75+0.25) of the code is non improvable and another half is improvable. Suppose you get infinite improvement which in terms reduces the time taken from 0.25+0.75 second to 0 second, total time of the algorithm goes down to 1 second. It means you got 2x speeds by adding more work.
Again did Herb fought the Amdahl’s law and won?
Negative. Herb didn’t even try to fight with the Amdahl’s law. Herb only proves it is better to take Gustafson’s law into consideration while deciding on going for parallelization or not. It is Amdahl’s law which must be taken in to consideration on calculating the amount of speedup that can be achieved by increasing number of cores. But the importance is you should not think negative due to the results of Amdahl’s law calculation. It is much more practical to understand your application will have more features added in future and those may be drastically improvable by executing tasks in parallel. It also means number of cores must be taken in to consideration rather then fixed size problems.
Conclusion
Break Amdahl’s law – It is a correct attitude. You must not get into a deadlock by finding the algorithm is not much improvable because of Amdahl’s law results.
“Herb fought the law—Amdahl's Law, that is—and Herb won-DDJ”- It is not a good title. If someone has ever tried to break the Amdahl’s law it is Gustafson. And even Gustafson named his paper “Reevaluating Amdahl's Law”
No matter how much speedup you get for the parallel code, it is impossible to make a speedup of 2x if 75% of the algorithm cannot be parallelized. Suppose if you do the other 25% with 0 seconds that makes you gain up to 1.3x. For quantifying, if your algorithm takes 1 second and if 0.75 second of it is non-improvable, it doesn’t matter how much improvement you make for other 0.25 seconds, you cant get the total time scaled down to half.
Understanding Gustafson’s law
Rather than speedup Gustafson takes work into consideration. It states if your algorithm takes 1 second to complete and you have 0.75 second non-improvable part you can still get 2x speedup if you add a new feature which takes another 1 second in which 0.75 second is improvable. It means in the total 2 seconds, 1 second (0.75+0.25) of the code is non improvable and another half is improvable. Suppose you get infinite improvement which in terms reduces the time taken from 0.25+0.75 second to 0 second, total time of the algorithm goes down to 1 second. It means you got 2x speeds by adding more work.
Again did Herb fought the Amdahl’s law and won?
Negative. Herb didn’t even try to fight with the Amdahl’s law. Herb only proves it is better to take Gustafson’s law into consideration while deciding on going for parallelization or not. It is Amdahl’s law which must be taken in to consideration on calculating the amount of speedup that can be achieved by increasing number of cores. But the importance is you should not think negative due to the results of Amdahl’s law calculation. It is much more practical to understand your application will have more features added in future and those may be drastically improvable by executing tasks in parallel. It also means number of cores must be taken in to consideration rather then fixed size problems.
Conclusion
Break Amdahl’s law – It is a correct attitude. You must not get into a deadlock by finding the algorithm is not much improvable because of Amdahl’s law results.
“Herb fought the law—Amdahl's Law, that is—and Herb won-DDJ”- It is not a good title. If someone has ever tried to break the Amdahl’s law it is Gustafson. And even Gustafson named his paper “Reevaluating Amdahl's Law”
Friday, January 18, 2008
Limiting the possible data types given for a template class during instantiation
How to limit the data types that can be used for instantiating a template class?
In C++ there is no standard mechanism to limit the data types that a template can be created with. For example,
template < class T>
class ATemplate
{
};
The above class (ATemplate) can be instantiated with any data type. So ‘T’ can be any data type (eg. basic data types, structures and classes). But what if you want to accept only selected data types? For example a template which only supports char, int and double. C++ doesn’t have a standard method to do it. But Microsoft have put some effort in .NET Generics (something like a C++ template) to limit the data types which can be supported. Even though C++ doesn’t support it directly it is possible to get the data types limited in C++ by making a small trick. Let us see an example.
template < class T >
class MyTemplate
{
T m_x;
void AllowThisType( int& obj ){}
void AllowThisType( char& obj ){}
void AllowThisType( double& obj ){}
public:
MyTemplate()
{
T tmp;
AllowThisType( tmp );
}
void SetX( T val )
{
m_x = val;
}
void print()
{
std::cout<< m_x<< std::endl;
}
};
int main()
{
MyTemplate < int > objint;
MyTemplate < char > objchar;
MyTemplate < double > objdouble;
MyTemplate < float > objfloat; //<< Error when you create this
}
How do the above program limit the data types that can be used for creating a template class object?
From the constructor of class MyTemplate an overloaded function AllowThisType is called. This function is having 3 different overloads. An integer reference, character reference and a double reference. So when you make object with int, char or double there is an appropriate AllowThisType method that compiler can find and match. But when you create an object with any other data type compiler cannot find a matching AllowThisType overload and hence the compilation fails. Let us see an example.
MyTemplate < float > objfloat;
When you make an object of MyTemplate with T as float compiler will look for an AllowThisType( float& ). Since we did not write an overload like AllowThisType( float& ) the compilation fails. So it is clear that the data types which can be given are limited here. Whenever an addiditional type needs to be supported a new AllowThisType overload must be added with the newly supporting data type.
In C++ there is no standard mechanism to limit the data types that a template can be created with. For example,
template < class T>
class ATemplate
{
};
The above class (ATemplate) can be instantiated with any data type. So ‘T’ can be any data type (eg. basic data types, structures and classes). But what if you want to accept only selected data types? For example a template which only supports char, int and double. C++ doesn’t have a standard method to do it. But Microsoft have put some effort in .NET Generics (something like a C++ template) to limit the data types which can be supported. Even though C++ doesn’t support it directly it is possible to get the data types limited in C++ by making a small trick. Let us see an example.
template < class T >
class MyTemplate
{
T m_x;
void AllowThisType( int& obj ){}
void AllowThisType( char& obj ){}
void AllowThisType( double& obj ){}
public:
MyTemplate()
{
T tmp;
AllowThisType( tmp );
}
void SetX( T val )
{
m_x = val;
}
void print()
{
std::cout<< m_x<< std::endl;
}
};
int main()
{
MyTemplate < int > objint;
MyTemplate < char > objchar;
MyTemplate < double > objdouble;
MyTemplate < float > objfloat; //<< Error when you create this
}
How do the above program limit the data types that can be used for creating a template class object?
From the constructor of class MyTemplate an overloaded function AllowThisType is called. This function is having 3 different overloads. An integer reference, character reference and a double reference. So when you make object with int, char or double there is an appropriate AllowThisType method that compiler can find and match. But when you create an object with any other data type compiler cannot find a matching AllowThisType overload and hence the compilation fails. Let us see an example.
MyTemplate < float > objfloat;
When you make an object of MyTemplate with T as float compiler will look for an AllowThisType( float& ). Since we did not write an overload like AllowThisType( float& ) the compilation fails. So it is clear that the data types which can be given are limited here. Whenever an addiditional type needs to be supported a new AllowThisType overload must be added with the newly supporting data type.
Monday, January 14, 2008
Teraflop processors is not far, Are we ready to use it?
“The prototype 80-core Polaris processor on a single chip delivered the super computer like performance of a trillion floating-point operations per second (one teraflop) while consuming less than 62 watts – Intel.”
It will take less than 10 years from now for a common man to have PC running on a teraflop processor. A prototype of teraflop processor has 80 cores which can be executed in parallel. So to get the most out of 80 cores we need to run 80 threads in parallel. Hence it’s clear that the program must be heavily threaded to make use of all cores.
The question that comes is, “It can deliver up to a Teraflop, but how we are going to get most out of it?”
These are the days when programmers are trying hard to get most out of a quad core or a dual core processor. These processors can give a lot but it is up to the programmer to make use of it.
Threading for parallelism
Most of the programmers used thread only for separating User Interface from the time consuming operations that happens according to the user operation. But those days are gone. Now the thread is not just to do things without blocking the other one. It is all about performance. Threading for performance is the key now.
The hard
Hands full of tools are available which helps to analyze, debug and optimize threads. It’s not hard to detect a synchronization problem or a thread over run. It’s easy these days to debug a chunk of code in different threads. But why all algorithms are not yet threaded? What is the big deal in it? It is discovering parallelism!!! Yea, the hardest ever thing in optimization is finding a parallel way to optimize the most time taking part of the algorithm. Mostly every time if we look the code of an algorithm the most time taking part will be entirely sequential. It will look like something which can never be parallelized. That is where it gets quite tricky. More and more innovation can only do something to get things parallelized. It’s not about parallelizing the code of the algorithm; it’s all about changing the algorithm in a parallel way!
It will take less than 10 years from now for a common man to have PC running on a teraflop processor. A prototype of teraflop processor has 80 cores which can be executed in parallel. So to get the most out of 80 cores we need to run 80 threads in parallel. Hence it’s clear that the program must be heavily threaded to make use of all cores.
The question that comes is, “It can deliver up to a Teraflop, but how we are going to get most out of it?”
These are the days when programmers are trying hard to get most out of a quad core or a dual core processor. These processors can give a lot but it is up to the programmer to make use of it.
Threading for parallelism
Most of the programmers used thread only for separating User Interface from the time consuming operations that happens according to the user operation. But those days are gone. Now the thread is not just to do things without blocking the other one. It is all about performance. Threading for performance is the key now.
The hard
Hands full of tools are available which helps to analyze, debug and optimize threads. It’s not hard to detect a synchronization problem or a thread over run. It’s easy these days to debug a chunk of code in different threads. But why all algorithms are not yet threaded? What is the big deal in it? It is discovering parallelism!!! Yea, the hardest ever thing in optimization is finding a parallel way to optimize the most time taking part of the algorithm. Mostly every time if we look the code of an algorithm the most time taking part will be entirely sequential. It will look like something which can never be parallelized. That is where it gets quite tricky. More and more innovation can only do something to get things parallelized. It’s not about parallelizing the code of the algorithm; it’s all about changing the algorithm in a parallel way!
Monday, December 10, 2007
What if someone come with a compiler capable of parallelizing the algorithms?
In many training sessions I have heard a common question from most of the business men. If I give training to all my employees on parallel computing can you guarantee me that no one will come up with an automatically parallelizing compiler tomorrow? Won't all the money I spent on the training get wasted?
Is it a big deal?
This has been a very serious question for all these time. May be most of the companies are still thinking something will happen and the performance will get improved automatically. But it is quite easy to prove that nothing is going to workout on automatic parallelizing of an algorithm.
How easy it is to prove?
May be someone can come up with a compiler that does some amount of parallelism. And even they already exist!!! But definitely it cannot parallelize the logic of your algorithm, isn’t it? It’s for sure that the core logic cannot be parallelized without a skilled programmer and an algorithm expert. A compiler can do some amount of logic less parallelism which is not going to increase the performance drastically. For a drastic improvement almost all of the algorithm implementations need to be rewritten in a parallel way. During those days when the algorithms were implemented there was no parallel computing on a personal computer. Only super computers had parallel processors. Now the time has changed. Almost all the desktop users in the world can afford to have a multi core machine in front. Even most have a high end graphics card inside as well. So it is for sure if you get skilled in parallel programming it is not going to be a waste. No compiler can be intelligent enough like you to do the parallelism. Only skilled human can do the parallel programming. Get yourself trained or leave your job to others!!!
Is it a big deal?
This has been a very serious question for all these time. May be most of the companies are still thinking something will happen and the performance will get improved automatically. But it is quite easy to prove that nothing is going to workout on automatic parallelizing of an algorithm.
How easy it is to prove?
May be someone can come up with a compiler that does some amount of parallelism. And even they already exist!!! But definitely it cannot parallelize the logic of your algorithm, isn’t it? It’s for sure that the core logic cannot be parallelized without a skilled programmer and an algorithm expert. A compiler can do some amount of logic less parallelism which is not going to increase the performance drastically. For a drastic improvement almost all of the algorithm implementations need to be rewritten in a parallel way. During those days when the algorithms were implemented there was no parallel computing on a personal computer. Only super computers had parallel processors. Now the time has changed. Almost all the desktop users in the world can afford to have a multi core machine in front. Even most have a high end graphics card inside as well. So it is for sure if you get skilled in parallel programming it is not going to be a waste. No compiler can be intelligent enough like you to do the parallelism. Only skilled human can do the parallel programming. Get yourself trained or leave your job to others!!!
Wednesday, December 5, 2007
Think parallel, save your product
I always wonder why we all speak about performance. We uses many jargons like performance, optimization, throughput, scalability, parallel thinking, hybrid computing, threads, locks, semaphores, synchronization, penalty, data parallelism, instruction parallelism and many more.
Why we are aiming on performance? What happened to the software industry in a very short span? Instead of just making new and new software why everyone thinks about increasing the performance of the existing ones?
What is the motivation?
Motivation number 1... The customer satisfaction...
I never want to know how complex the algorithm is, I want to get it done in slice of a second...
I paid $10000 for your product and now there are 100 other products which takes a second to do what you gets done in a day...You cheated me?
My system is having a quad core processor and you takes 100 seconds to get things done. When I look the CPU usage it is just 25%, what are you doing inside your program? Why did i buy a quad core machine spending all my money?
Motivation number 2... A stitch in time saves nine...
What if your software takes an hour to detect the spread blood inside brain? The patient may die before you detect the problem.
What if it takes an hour to diagnose your vehicles electrical problem? Don't you have tight schedule?
Motivation number 3... There is no point in watering a dead plant...
What is the use if you detect a tsunami after it has hit the shores?
What is the use if you predict tomorrows weather 2 days later?
What are you going to achieve if you detect a fire in a chamber and the alarm gets triggered 10 seconds late?
I conclude it like this.
So it is sure that time matters. If your competitor do a thing much faster than you what happens? The answer is clear. You can’t compete anymore.
Why we are aiming on performance? What happened to the software industry in a very short span? Instead of just making new and new software why everyone thinks about increasing the performance of the existing ones?
What is the motivation?
Motivation number 1... The customer satisfaction...
I never want to know how complex the algorithm is, I want to get it done in slice of a second...
I paid $10000 for your product and now there are 100 other products which takes a second to do what you gets done in a day...You cheated me?
My system is having a quad core processor and you takes 100 seconds to get things done. When I look the CPU usage it is just 25%, what are you doing inside your program? Why did i buy a quad core machine spending all my money?
Motivation number 2... A stitch in time saves nine...
What if your software takes an hour to detect the spread blood inside brain? The patient may die before you detect the problem.
What if it takes an hour to diagnose your vehicles electrical problem? Don't you have tight schedule?
Motivation number 3... There is no point in watering a dead plant...
What is the use if you detect a tsunami after it has hit the shores?
What is the use if you predict tomorrows weather 2 days later?
What are you going to achieve if you detect a fire in a chamber and the alarm gets triggered 10 seconds late?
I conclude it like this.
So it is sure that time matters. If your competitor do a thing much faster than you what happens? The answer is clear. You can’t compete anymore.
Parallel Extension to .NET Framework
Something to hope for .NET programmers!!!
Microsoft has come with a parallel extension to .NET framework (managed code). This may be a revolution in making high performance programs using .NET. But I wonder how many HPC applications can be done using .NET because of its lack of speed compared to a C/C++ program. I have written another article about the performance difference between C++ and c# at http://amalp.blogspot.com/2007/10/performance-analysis-c-vs-c.html. To make use of all the cores in a Multicore environment we definitely need threading. So using of parallel extensions can make the program run quite faster on a system with more than one core. Even if we use the parallel extension it will be little possible for a managed code to run as faster as a unmanaged code. So for performance either C or C++ is the best. For more information about parallel extension visit the blog http://blogs.msdn.com/somasegar/archive/2007/11/29/parallel-extensions-to-the-net-fx-ctp.aspx
Still C# zags in performance?
C# increases productivity by compromising performance. But if someone else makes the same application using C++ which takes half the time; what is the use of that productivity? May be you can start selling earlier and stop selling earlier.
Microsoft has come with a parallel extension to .NET framework (managed code). This may be a revolution in making high performance programs using .NET. But I wonder how many HPC applications can be done using .NET because of its lack of speed compared to a C/C++ program. I have written another article about the performance difference between C++ and c# at http://amalp.blogspot.com/2007/10/performance-analysis-c-vs-c.html. To make use of all the cores in a Multicore environment we definitely need threading. So using of parallel extensions can make the program run quite faster on a system with more than one core. Even if we use the parallel extension it will be little possible for a managed code to run as faster as a unmanaged code. So for performance either C or C++ is the best. For more information about parallel extension visit the blog http://blogs.msdn.com/somasegar/archive/2007/11/29/parallel-extensions-to-the-net-fx-ctp.aspx
Still C# zags in performance?
C# increases productivity by compromising performance. But if someone else makes the same application using C++ which takes half the time; what is the use of that productivity? May be you can start selling earlier and stop selling earlier.
Monday, December 3, 2007
High performance computing
Why?
As human being greedy we will never settle down with what we have got. We will always look for more. That is exactly happening in high performance computing industry. Long back we had very slow processors which took seconds to sort a small chunk of data. Now we want to predict the climate of each and every location of world within seconds. There are lots of medical imaging algorithms waiting in the shore for more computation power. There are lots of generic algorithms which would solve many problems in the real world which needs a bulk more computational power. So that is why the high performance computing industry is quite hot.
The business
The first name that comes when looking into high performance computing is Intel. They are coming up with processors with more and more cores. Then there is AMD in the form of CPU and GPU(ATI is now AMD's). But the leader in GPU is still NVIDIA with their latest GPU 8800 ultra which have 128 SIMD cores. There is a brand new architecture from IBM called IBM Cell, Intel is going to release larabee next year.
Looking into software Intel have their own compiler which compiles High level code to machine code with highest optimization for CPU, NVIDIA have CUDA for doing the general purpose programs in GPU, There is directx & HLSL from microsoft for GPU, There is Cg and CgFx from NVIDIA, There is OpenGL from ARB, and RapidMind have their own stream programming libraries. And we can’t hear a brand peak stream now because google bought it.
The Free Lunch Is Over, A Fundamental Turn Toward Concurrency in Software - Herb Slutter
Till today the programmers didn’t need to think much about performance optimization. The hardware vendors were keeping on improving their hardware which needs no change in software to improve the performance. But that have reached its limit. The clock speed can’t be increased anymore; the power can’t be increased due to heat dissipation; the physics is catching up. The free lunch is over. Now it is multicore. It is parallel thinking which can improve the performance.
Think parallel or perish - Intel
Intel say either think parallel or get perished. If you still keep on writing a single threaded serial code your software will be outdated. Do you think someone going to buy your software when one other can do the same thing in one tenth of your time? It is only parallelism which can improve he performance now.
The GPGPU
What is this GP in GPU? Looking wierd? But it is reality. You can do a lot of multi threaded application using GPU which does general purpose tasks instead of usual graphics tasks. CUDA from NVIDIA is the best way to make GPGPU program.
Conclusion
If your algorithm is not parallel; If your application is still running on single thread; If you still keep on thinking someone else will speedup your algorithm.; You will be perished.. Your algorithm will not have existence. Better late than never!!!
As human being greedy we will never settle down with what we have got. We will always look for more. That is exactly happening in high performance computing industry. Long back we had very slow processors which took seconds to sort a small chunk of data. Now we want to predict the climate of each and every location of world within seconds. There are lots of medical imaging algorithms waiting in the shore for more computation power. There are lots of generic algorithms which would solve many problems in the real world which needs a bulk more computational power. So that is why the high performance computing industry is quite hot.
The business
The first name that comes when looking into high performance computing is Intel. They are coming up with processors with more and more cores. Then there is AMD in the form of CPU and GPU(ATI is now AMD's). But the leader in GPU is still NVIDIA with their latest GPU 8800 ultra which have 128 SIMD cores. There is a brand new architecture from IBM called IBM Cell, Intel is going to release larabee next year.
Looking into software Intel have their own compiler which compiles High level code to machine code with highest optimization for CPU, NVIDIA have CUDA for doing the general purpose programs in GPU, There is directx & HLSL from microsoft for GPU, There is Cg and CgFx from NVIDIA, There is OpenGL from ARB, and RapidMind have their own stream programming libraries. And we can’t hear a brand peak stream now because google bought it.
The Free Lunch Is Over, A Fundamental Turn Toward Concurrency in Software - Herb Slutter
Till today the programmers didn’t need to think much about performance optimization. The hardware vendors were keeping on improving their hardware which needs no change in software to improve the performance. But that have reached its limit. The clock speed can’t be increased anymore; the power can’t be increased due to heat dissipation; the physics is catching up. The free lunch is over. Now it is multicore. It is parallel thinking which can improve the performance.
Think parallel or perish - Intel
Intel say either think parallel or get perished. If you still keep on writing a single threaded serial code your software will be outdated. Do you think someone going to buy your software when one other can do the same thing in one tenth of your time? It is only parallelism which can improve he performance now.
The GPGPU
What is this GP in GPU? Looking wierd? But it is reality. You can do a lot of multi threaded application using GPU which does general purpose tasks instead of usual graphics tasks. CUDA from NVIDIA is the best way to make GPGPU program.
Conclusion
If your algorithm is not parallel; If your application is still running on single thread; If you still keep on thinking someone else will speedup your algorithm.; You will be perished.. Your algorithm will not have existence. Better late than never!!!
Saturday, December 1, 2007
CUDA - Compute Unified Device Architecture
Compute Unified Device Architecture is an easy way to use GPU for General Purpose Programming. No graphics knowledge is required to use CUDA for doing a program using GPU. A CUDA program is almost same as a C program but have some additional features. In CUDA a function can be run in many threads by giving a execution configuration while calling a function. There are 3 kinds of functions in CUDA. A device function which can be only executed in the device and called from device, a global function which can be called from host(CPU) using some configuration and gets executed in device, and a pure host function which must be executed in the CPU only.
There are some additional specifiers to distinguish the function type.
__device__ - if a function is suffixed with __device__ it becomes a device function which can only be executed at device and which can only be called from a function that executes on device.
__host__ - These functions are the normal C functions that can be executed on the host(CPU)
__global__ - A function suffixed with __global__ can be called from CPU. But for calling this function the execution configuration must be mentioned. The execution configuration decides how many threads and blocks have to be made for executing this function.
Also there are different kinds of memory,
__shared__ - if this is prefixed that memory becomes a shared memory and it can be shared across threads. This is the fastest memory.
__constant__ - a memory to which we can write from host only.
__device__ - a memory to which we can write from both device and host.
Each thread will have a thread ID and block ID to know which area of the data need to be processed by this thread. This is the tricky area where all the performance improvement lies.
There are some additional specifiers to distinguish the function type.
__device__ - if a function is suffixed with __device__ it becomes a device function which can only be executed at device and which can only be called from a function that executes on device.
__host__ - These functions are the normal C functions that can be executed on the host(CPU)
__global__ - A function suffixed with __global__ can be called from CPU. But for calling this function the execution configuration must be mentioned. The execution configuration decides how many threads and blocks have to be made for executing this function.
Also there are different kinds of memory,
__shared__ - if this is prefixed that memory becomes a shared memory and it can be shared across threads. This is the fastest memory.
__constant__ - a memory to which we can write from host only.
__device__ - a memory to which we can write from both device and host.
Each thread will have a thread ID and block ID to know which area of the data need to be processed by this thread. This is the tricky area where all the performance improvement lies.
Thursday, November 1, 2007
Singleton design pattern
What is a singleton?
It means nothing much than its word meaning. In c++ point of view Singleton means a class that can only have one object. So,
A singleton class is a class that will have only one instance at any time.
When to go for singleton?
You must use a singleton pattern if your class must only have one instance. For example,
* If your class is doing some loging to a console you may need to log all info to same console. Then go for singleton. Allow only one instance of the class to be created.
* If your class uses some device which must only be created/initialized ones and multiple object should not access the same go for singleton.
* If from more than one place you want to use the same instance of a class we can definitely think about using singleton.
How to make a class singleton?
There are different ways for making a singleton class. But there are some common thing to do while making a Singleton class. They are,
* Make the constructor of the class private
* Make the copy constructor of the class private
* Overload the = operator as private
* Write a static member function which returns the static object created inside class/function.
Implementation
#include
using namespace std;
class SingleTon
{
static SingleTon m_OneAndOnlyObject;
int m_nValue;
private:
// Constructor is declared as private
SingleTon( int nVal ):m_nValue( nVal ){;}
// Copy constructor is declared as private
SingleTon( const SingleTon& );
// = operator is declared as private
void operator = ( SingleTon& );
public:
static SingleTon& getInstance()
{
// The same instance will be always returned
return m_OneAndOnlyObject;
}
void SetValue( int nVal ){ m_nValue = nVal; }
int GetValue( void ){ return m_nValue; }
};
// Initializing the static object
SingleTon SingleTon::m_OneAndOnlyObject( 10 );
int main(int argc, char* argv[])
{
// Getting a reference
SingleTon& Obj1 = SingleTon::getInstance();
// Printing the value using Obj1-
// and setting the value using Obj1
cout << "Obj1.m_nValue: " << Obj1.GetValue() << endl;
Obj1.SetValue( 20 );
// Making second reference and printing the value
SingleTon& Obj2 = SingleTon::getInstance();
cout << "Obj2.m_nValue: " << Obj2.GetValue() << endl;
// Changing the value of m_nValue using Obj2-
// and Printing using Obj1
Obj2.SetValue( 30 );
cout<<"Obj1.m_nValue: "< return 0;
}
In the above program we have created 2 objects. If we set the value of m_nValue using any of the reference it will reflect in all others because there is only one object.
Variants:
There are different methods for making a singleton class
1. By creating static object inside the getInstance function.
Code:
static SingleTon& getInstance()
{
// The same instance will be always returned
static SingleTon OneAndOnlyObject(10);
return OneAndOnlyObject;
}
2. Creating in heap
Code:
static SingleTon* m_OneAndOnlyObjectHeap; // Change the static object to a pointer
static SingleTon* getInstance()
{
// The same instance will be always returned
if( !m_OneAndOnlyObjectHeap )
{
m_OneAndOnlyObjectHeap = new SingleTon( 10 );
}
return m_OneAndOnlyObjectHeap;
}
SingleTon* SingleTon::m_OneAndOnlyObjectHeap( 0 );//Static member initialization
The variants differ in creation of object. Everything else is the same.
It means nothing much than its word meaning. In c++ point of view Singleton means a class that can only have one object. So,
A singleton class is a class that will have only one instance at any time.
When to go for singleton?
You must use a singleton pattern if your class must only have one instance. For example,
* If your class is doing some loging to a console you may need to log all info to same console. Then go for singleton. Allow only one instance of the class to be created.
* If your class uses some device which must only be created/initialized ones and multiple object should not access the same go for singleton.
* If from more than one place you want to use the same instance of a class we can definitely think about using singleton.
How to make a class singleton?
There are different ways for making a singleton class. But there are some common thing to do while making a Singleton class. They are,
* Make the constructor of the class private
* Make the copy constructor of the class private
* Overload the = operator as private
* Write a static member function which returns the static object created inside class/function.
Implementation
#include
using namespace std;
class SingleTon
{
static SingleTon m_OneAndOnlyObject;
int m_nValue;
private:
// Constructor is declared as private
SingleTon( int nVal ):m_nValue( nVal ){;}
// Copy constructor is declared as private
SingleTon( const SingleTon& );
// = operator is declared as private
void operator = ( SingleTon& );
public:
static SingleTon& getInstance()
{
// The same instance will be always returned
return m_OneAndOnlyObject;
}
void SetValue( int nVal ){ m_nValue = nVal; }
int GetValue( void ){ return m_nValue; }
};
// Initializing the static object
SingleTon SingleTon::m_OneAndOnlyObject( 10 );
int main(int argc, char* argv[])
{
// Getting a reference
SingleTon& Obj1 = SingleTon::getInstance();
// Printing the value using Obj1-
// and setting the value using Obj1
cout << "Obj1.m_nValue: " << Obj1.GetValue() << endl;
Obj1.SetValue( 20 );
// Making second reference and printing the value
SingleTon& Obj2 = SingleTon::getInstance();
cout << "Obj2.m_nValue: " << Obj2.GetValue() << endl;
// Changing the value of m_nValue using Obj2-
// and Printing using Obj1
Obj2.SetValue( 30 );
cout<<"Obj1.m_nValue: "<
}
In the above program we have created 2 objects. If we set the value of m_nValue using any of the reference it will reflect in all others because there is only one object.
Variants:
There are different methods for making a singleton class
1. By creating static object inside the getInstance function.
Code:
static SingleTon& getInstance()
{
// The same instance will be always returned
static SingleTon OneAndOnlyObject(10);
return OneAndOnlyObject;
}
2. Creating in heap
Code:
static SingleTon* m_OneAndOnlyObjectHeap; // Change the static object to a pointer
static SingleTon* getInstance()
{
// The same instance will be always returned
if( !m_OneAndOnlyObjectHeap )
{
m_OneAndOnlyObjectHeap = new SingleTon( 10 );
}
return m_OneAndOnlyObjectHeap;
}
SingleTon* SingleTon::m_OneAndOnlyObjectHeap( 0 );//Static member initialization
The variants differ in creation of object. Everything else is the same.
Wednesday, October 31, 2007
What is a protected abstract virtual base pure virtual private destructor?
A protected abstract virtual base pure virtual private destructor.
This is one of the funny question and very less answered one. It may be a very long sentence. But the code needed for making a protected abstract virtual base pure virtual private destructor is quite simple.
The below code is the one which makes a protected abstract virtual base pure virtual private destructor
Program:
class BaseClass // An abstract class
{
public:
virtual void MakeAbstract() = 0;
};
// A class derived as an protected abstract virtual base
class AbstractBase : virtual protected BaseClass
{
private:
void MakeAbstract(){;}
// A pure virtual private destructor
virtual ~AbstractBase() = 0;
friend class Derived;
};
AbstractBase::~AbstractBase()
{
}
class Derived : protected AbstractBase
{
};
int main(int argc, _TCHAR* argv[])
{
// You can definitely make an object of class Derived
Derived obj;
return 0;
}
Explanation:
In the above program AbstractBase::~AbstractBase() can be called as a protected abstract virtual base pure virtual private destructor. Let us see how it can be called so.
1. The class AbstractBase is derived as "virtual protected" from an "abstract base" class. So we can call class AbstractBase as a "protected abstract virtual base".
2. Now let us check the destructor of class AbstractBase. It is made as a "pure virtual private" one. So we can call it as a "pure virtual private destructor".
3. Now combining both, we can call the destructor of AbstractBase as "protected abstract virtual base pure virtual private destructor"
Use:
This question can measure the knowledge in C++. Practically it wont have much of a use in implementation point of view. This question was made to prove that C++ is too complex and weird. But when we see the code for such a big definition it serves in the opposite way. Its quite easy to write long sentences in very few line of codes.
This is one of the funny question and very less answered one. It may be a very long sentence. But the code needed for making a protected abstract virtual base pure virtual private destructor is quite simple.
The below code is the one which makes a protected abstract virtual base pure virtual private destructor
Program:
class BaseClass // An abstract class
{
public:
virtual void MakeAbstract() = 0;
};
// A class derived as an protected abstract virtual base
class AbstractBase : virtual protected BaseClass
{
private:
void MakeAbstract(){;}
// A pure virtual private destructor
virtual ~AbstractBase() = 0;
friend class Derived;
};
AbstractBase::~AbstractBase()
{
}
class Derived : protected AbstractBase
{
};
int main(int argc, _TCHAR* argv[])
{
// You can definitely make an object of class Derived
Derived obj;
return 0;
}
Explanation:
In the above program AbstractBase::~AbstractBase() can be called as a protected abstract virtual base pure virtual private destructor. Let us see how it can be called so.
1. The class AbstractBase is derived as "virtual protected" from an "abstract base" class. So we can call class AbstractBase as a "protected abstract virtual base".
2. Now let us check the destructor of class AbstractBase. It is made as a "pure virtual private" one. So we can call it as a "pure virtual private destructor".
3. Now combining both, we can call the destructor of AbstractBase as "protected abstract virtual base pure virtual private destructor"
Use:
This question can measure the knowledge in C++. Practically it wont have much of a use in implementation point of view. This question was made to prove that C++ is too complex and weird. But when we see the code for such a big definition it serves in the opposite way. Its quite easy to write long sentences in very few line of codes.
RapidMind - Stream programming
Why RapidMind?
RapidMind helps us to introduce data parallelism in our program. Data parallelism can optimize the program speed to a big extent. RapidMind gives data parallelism by using stream computing.
Stream computing and Stream processors are nothing new to most of us right now. It helps us to execute some kernels(functions) on multiple data. Intel SSE, GPU etc are example of Stream Computing. Rapid mind as of the data supports both Cell BE( An IBM Cell architecture ) and GPU( like nVIDIA, ATI ) and will be supporting X86 architecture in near future.
It is quite easy to convert your serial program to a stream program using RapidMind. RapidMind is purely implemented in C++. Every things are wrapped into a namespace RapidMind. So the development also becomes easy.
Now I will quote an example for converting a normal serial program to a RapidMind program.
The below programs does operations on 4 floating point values. The operations are done on 16 bytes. Let us see how the implementation differs for a RapidMind program and a normal program.
Normal program
float SquareofIndividualSquare( float a, float b )
{
return(( a*a + b*b ) * ( a*a + b*b ));
}
int _tmain(int argc, _TCHAR* argv[])
{
float* fFirstElement = new float[2048*2048*4];
float* fSecondElement = new float[2048*2048*4];
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
nIndex = 0;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++,nIndex++ )
{
fFirstElement[nIndex] = SquareofIndividualSquare( fFirstElement[nIndex],fSecondElement[nIndex] );
}
}
}
}
RapidMind Program
#include
using namespace rapidmind;
int main()
{
// Do the initialization of rapid mind platform
rapidmind::init();
// Since GPU is used set the backend as GLSL( OpenGL shader )
use_backend("glsl");
// Array is template class.
// Value4f means 4 floats per each element
Array<2,Value4f> a(2048,2048);
Array<2,Value4f> b(2048,2048);
// This is how we get access to the actual array location.
// Now we can use these pointer to -
// manipulate internal data using CPU.
float* fFirstElement = a.write_data();
float* fSecondElement = b.write_data();
// Fill the input arrays
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
// This array can get the output data. A normal array.
Array<2,Value4f> output;
// The stream program that will be executed on the data
// This will be executed on GPU.
Program prg = RM_BEGIN {
In a; // First input
In b; // Second input
Out c; // Output
c = (a*a + b*b)*(a*a+b*b); // Data manipulation
} RM_END;
// Execute the stream program
// The output will be available in output array.
output = prg(a, b);
}
Description:
We can see that in the rapid mind program the internal for loop can be replaced. The multiplication of 4 floats are done with one line. This is the advantage of using stream computing. You can process more than one data at a time.
Important:
When we check the performance of the above program we will find the CPU is giving high performance. But this wont be the case if we do a lots of processing. The CPU gives better performance for the programs with very less processing on data because of CPU caching and memory speed. But if we have a chunk of data and we need to do a lot of process on that data the RapidMind will be the best option. Also we are expecting a x86 version of RapidMind. If an x86 version is available it may take out this problem also.
RapidMind helps us to introduce data parallelism in our program. Data parallelism can optimize the program speed to a big extent. RapidMind gives data parallelism by using stream computing.
Stream computing and Stream processors are nothing new to most of us right now. It helps us to execute some kernels(functions) on multiple data. Intel SSE, GPU etc are example of Stream Computing. Rapid mind as of the data supports both Cell BE( An IBM Cell architecture ) and GPU( like nVIDIA, ATI ) and will be supporting X86 architecture in near future.
It is quite easy to convert your serial program to a stream program using RapidMind. RapidMind is purely implemented in C++. Every things are wrapped into a namespace RapidMind. So the development also becomes easy.
Now I will quote an example for converting a normal serial program to a RapidMind program.
The below programs does operations on 4 floating point values. The operations are done on 16 bytes. Let us see how the implementation differs for a RapidMind program and a normal program.
Normal program
float SquareofIndividualSquare( float a, float b )
{
return(( a*a + b*b ) * ( a*a + b*b ));
}
int _tmain(int argc, _TCHAR* argv[])
{
float* fFirstElement = new float[2048*2048*4];
float* fSecondElement = new float[2048*2048*4];
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
nIndex = 0;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++,nIndex++ )
{
fFirstElement[nIndex] = SquareofIndividualSquare( fFirstElement[nIndex],fSecondElement[nIndex] );
}
}
}
}
RapidMind Program
#include
using namespace rapidmind;
int main()
{
// Do the initialization of rapid mind platform
rapidmind::init();
// Since GPU is used set the backend as GLSL( OpenGL shader )
use_backend("glsl");
// Array is template class.
// Value4f means 4 floats per each element
Array<2,Value4f> a(2048,2048);
Array<2,Value4f> b(2048,2048);
// This is how we get access to the actual array location.
// Now we can use these pointer to -
// manipulate internal data using CPU.
float* fFirstElement = a.write_data();
float* fSecondElement = b.write_data();
// Fill the input arrays
int nIndex = 0;;
for( int i = 0; i < 2048; i++ )
{
for ( int j = 0; j < 2048; j++ )
{
for ( int floatnum = 0; floatnum < 4; floatnum++, nIndex++ )
{
fFirstElement[nIndex] = float(floatnum);
fSecondElement[nIndex] = float(floatnum);
}
}
}
// This array can get the output data. A normal array.
Array<2,Value4f> output;
// The stream program that will be executed on the data
// This will be executed on GPU.
Program prg = RM_BEGIN {
In
In
Out
c = (a*a + b*b)*(a*a+b*b); // Data manipulation
} RM_END;
// Execute the stream program
// The output will be available in output array.
output = prg(a, b);
}
Description:
We can see that in the rapid mind program the internal for loop can be replaced. The multiplication of 4 floats are done with one line. This is the advantage of using stream computing. You can process more than one data at a time.
Important:
When we check the performance of the above program we will find the CPU is giving high performance. But this wont be the case if we do a lots of processing. The CPU gives better performance for the programs with very less processing on data because of CPU caching and memory speed. But if we have a chunk of data and we need to do a lot of process on that data the RapidMind will be the best option. Also we are expecting a x86 version of RapidMind. If an x86 version is available it may take out this problem also.
Saturday, October 27, 2007
Performance analysis C++ vs C#
Performance analysis C++ vs C#
Description
C++ or C#, Which is the best language?
This question have a clear answer if you are thinking in performance point of view. In the performance area C++ zigs where C# zags.
Let us take a small example,
Here I am doing some matrix operations using C++ and C#. Both are executing same algorithm.
C++ Program
int nRetCode = 0;
const int nSize = 500;
int* nMatrix1 = new int[nSize*nSize];
int* nMatrix2 = new int[nSize*nSize];
int* nMultipliedMatrix = new int[nSize*nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize+j] = i;
nMatrix2[i*nSize+j] = j;
}
}
int nElapsed = 0;
int nLoopCount = 5;
for( int nVal = 0; nVal < nLoopCount; nVal++ )
{
int nStart = GetTickCount();
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i*nSize+j] = nMatrix1[i*nSize+k] + nMatrix2[k*nSize+j];
}
}
}
nElapsed += GetTickCount()- nStart;
}
delete[] nMatrix1;
delete[] nMatrix2;
delete[] nMultipliedMatrix;
std::cout << nElapsed / nLoopCount;
return nRetCode;
C# Program
class Program
{
public const int nSize = 500;
static void Main(string[] args)
{
int[] nMatrix1 = new int[nSize * nSize];
int[] nMatrix2 = new int[nSize * nSize];
int[] nMultipliedMatrix = new int[nSize * nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize + j] = i;
nMatrix2[i*nSize + j] = j;
}
}
int nLoopCount = 5;
int nElapsed = 0;
for (int nVal = 0; nVal < nLoopCount; nVal++)
{
int nStart = Environment.TickCount;
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i * nSize + j] = nMatrix1[i * nSize + k] + nMatrix2[k * nSize + j];
}
}
}
nElapsed += Environment.TickCount - nStart;
}
Console.WriteLine(nElapsed/nLoopCount);
}
}
Above program does some basic matrix operations. While checking the performance of same algorithm implemented using C++ and C# it can be understood that the C++ is giving an excellent performance.
When this program was ran on an Intel Pentium 4 3.2Ghz machine with 1GB RAM the time taken was as follows.
C++ code(Average of 5 execution) = 785ms.
C# code(Average of 5 execution) = 1465ms.
So clearly we can understand that the C++ outplays C# in the case of performance. Even in executing a basic algorithm without many of the OOP like overloading, runtime polymorphism the program is taking this much of performance loss.
But c# have many other advantages like maintainability, understandability etc. But all these comes at the cost of performance.
Description
C++ or C#, Which is the best language?
This question have a clear answer if you are thinking in performance point of view. In the performance area C++ zigs where C# zags.
Let us take a small example,
Here I am doing some matrix operations using C++ and C#. Both are executing same algorithm.
C++ Program
int nRetCode = 0;
const int nSize = 500;
int* nMatrix1 = new int[nSize*nSize];
int* nMatrix2 = new int[nSize*nSize];
int* nMultipliedMatrix = new int[nSize*nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize+j] = i;
nMatrix2[i*nSize+j] = j;
}
}
int nElapsed = 0;
int nLoopCount = 5;
for( int nVal = 0; nVal < nLoopCount; nVal++ )
{
int nStart = GetTickCount();
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i*nSize+j] = nMatrix1[i*nSize+k] + nMatrix2[k*nSize+j];
}
}
}
nElapsed += GetTickCount()- nStart;
}
delete[] nMatrix1;
delete[] nMatrix2;
delete[] nMultipliedMatrix;
std::cout << nElapsed / nLoopCount;
return nRetCode;
C# Program
class Program
{
public const int nSize = 500;
static void Main(string[] args)
{
int[] nMatrix1 = new int[nSize * nSize];
int[] nMatrix2 = new int[nSize * nSize];
int[] nMultipliedMatrix = new int[nSize * nSize];
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
nMatrix1[i*nSize + j] = i;
nMatrix2[i*nSize + j] = j;
}
}
int nLoopCount = 5;
int nElapsed = 0;
for (int nVal = 0; nVal < nLoopCount; nVal++)
{
int nStart = Environment.TickCount;
for (int i = 0; i < nSize; i++)
{
for (int j = 0; j < nSize; j++)
{
for (int k = 0; k < nSize; k++)
{
nMultipliedMatrix[i * nSize + j] = nMatrix1[i * nSize + k] + nMatrix2[k * nSize + j];
}
}
}
nElapsed += Environment.TickCount - nStart;
}
Console.WriteLine(nElapsed/nLoopCount);
}
}
Above program does some basic matrix operations. While checking the performance of same algorithm implemented using C++ and C# it can be understood that the C++ is giving an excellent performance.
When this program was ran on an Intel Pentium 4 3.2Ghz machine with 1GB RAM the time taken was as follows.
C++ code(Average of 5 execution) = 785ms.
C# code(Average of 5 execution) = 1465ms.
So clearly we can understand that the C++ outplays C# in the case of performance. Even in executing a basic algorithm without many of the OOP like overloading, runtime polymorphism the program is taking this much of performance loss.
But c# have many other advantages like maintainability, understandability etc. But all these comes at the cost of performance.
SIMD - Intel SSE( Streaming SIMD Extension )
Intel SSE - Streaming SIMD Extension
What is SSE?
SSE is a an instruction set which has 4 series. SSE, SSE2, SSE3, SSE4. These instructions work on 128bit registers called XMM register. So the application can even grow 4 times faster. The instruction set can be downloaded from intel website directly. There are two different concepts in SSE which lets you read and write a block of data in memory.
Prefetching: The prefetching helps you to cache the data from memory before the use of it comes. You can cache the data to different caches as you select.
Non-temporal storing - The non temporal storing helps you to write data to memory bypassing the cache. This can help you to avoid cache polution.
Example:
An example is given below which does memory copy of data. The ordinary memcpy function will copy data as 4byte blocks in the best case. But using SSE instruction we can copy 16bytes of data together. And normal memory copy polutes the cache whereas the SSE instruction can bypass the cache. The following code is just a stub.
Normal memory copy:
mov ecx, count
shr ecx, 1 // copying 2bytes at a time
mov esi, source
mov edi, destination
rep movsd // moves the data from source to destination. the ecx will be used as number of bytes
SSE memory copy:
mov ecx, count
shr ecx, 4 // copying 16bytes at a time
mov esi, source
prefetchnta [esi] // prefetching the data from cache
mov edi, destination
cmp ecx, 0
jz END
NEXT:
movdqa xmm0, [esi] // reading the data from memory expecting memory is 16 byte aligned
movntdq [edi], xmm0 // writing data directly to memory bypassing cache expecting destination memory is 16 byte aligned
cmp ecx, 0
jnz NEXT
I have written a small code block which copies memory using SSE and without SSE. The code with SSE will work very much faster than the one witout SSE.
So if the program can be data parallelized SSE instructions can improve the performance of a program quite heavily.
What is SSE?
SSE is a an instruction set which has 4 series. SSE, SSE2, SSE3, SSE4. These instructions work on 128bit registers called XMM register. So the application can even grow 4 times faster. The instruction set can be downloaded from intel website directly. There are two different concepts in SSE which lets you read and write a block of data in memory.
Prefetching: The prefetching helps you to cache the data from memory before the use of it comes. You can cache the data to different caches as you select.
Non-temporal storing - The non temporal storing helps you to write data to memory bypassing the cache. This can help you to avoid cache polution.
Example:
An example is given below which does memory copy of data. The ordinary memcpy function will copy data as 4byte blocks in the best case. But using SSE instruction we can copy 16bytes of data together. And normal memory copy polutes the cache whereas the SSE instruction can bypass the cache. The following code is just a stub.
Normal memory copy:
mov ecx, count
shr ecx, 1 // copying 2bytes at a time
mov esi, source
mov edi, destination
rep movsd // moves the data from source to destination. the ecx will be used as number of bytes
SSE memory copy:
mov ecx, count
shr ecx, 4 // copying 16bytes at a time
mov esi, source
prefetchnta [esi] // prefetching the data from cache
mov edi, destination
cmp ecx, 0
jz END
NEXT:
movdqa xmm0, [esi] // reading the data from memory expecting memory is 16 byte aligned
movntdq [edi], xmm0 // writing data directly to memory bypassing cache expecting destination memory is 16 byte aligned
cmp ecx, 0
jnz NEXT
I have written a small code block which copies memory using SSE and without SSE. The code with SSE will work very much faster than the one witout SSE.
So if the program can be data parallelized SSE instructions can improve the performance of a program quite heavily.
Friday, July 27, 2007
typedef name as identifier of constructor/destructor
Description: As per the C++98 a constructor declarator can be a typedef name if the declaration is inside class member specification. It must not be allowed if the declaration is done outside the class member specification.
But the both GCC and VC++ differs in the behaviour. VC++ have correct implementation compared to gcc 3.4.2. Consider the following example,
Example:
class Alpha;
typedef Alpha Constructor;
class Alpha
{
public:
Constructor::Constructor(){}
};
int main()
{
Alpha obj;
}
Explanation:
If you compile the above program using VC++ it will be getting compiled. But if you use the gcc 3.4.2 it will not compile.
I think it is a mistake with GCC because they did not take care of this point in the ISOC++98 standard. And it may be an implementation artifact of Visual C++ that gets this code compiled.
But the both GCC and VC++ differs in the behaviour. VC++ have correct implementation compared to gcc 3.4.2. Consider the following example,
Example:
class Alpha;
typedef Alpha Constructor;
class Alpha
{
public:
Constructor::Constructor(){}
};
int main()
{
Alpha obj;
}
Explanation:
If you compile the above program using VC++ it will be getting compiled. But if you use the gcc 3.4.2 it will not compile.
I think it is a mistake with GCC because they did not take care of this point in the ISOC++98 standard. And it may be an implementation artifact of Visual C++ that gets this code compiled.
Friday, July 20, 2007
Calling a virtual function of a class from its constructor/destructor
Description:
As you know the construction is done from base to derived and destruction just opposite you should avoid calling a virtual funciton from both constructor and destructor. When you call the virtual function from constructor the derived class is not yet constructed and if from destructor the derived class is already destructed. Hence compiler will make an arrangement so that the local class(ie Base class) virtual function itself is called.
If you call a pure virtual function from your constructor/destructor directly or indirectly the program will be ill-formed. It can either show a compile time error or else a runtime error "Pure virtual function called"(normaly runtime error occures in case of indirect call from constructor ie calling a local non virtual member function and from that member function a call to pure virtual function ).
Example 1
class Base
{
public:
Base()
{
CallPureVirtual(); // Calling pure virtual indirectly from constructor // LINE 6
}
private:
virtual void PureVirtual() = 0;
void CallPureVirtual(){ PureVirtual(); };
};
class Derived : public Base
{
public:
Derived()
{
}
void PureVirtual()
{
std::cout << "Derived::PureVirtual()" << std::endl;
}
};
int main()
{
Derived obj;
return 0;
}
Output and explanation
If the above program is compiled using gcc/VC++2005/VC6 it will definitly give a runtime error. Because a pure virtual function is called.
Example 2
class Base
{
public:
Base()
{
std::cout << "From Base Constructor" << std::endl;
VirtualFun();
}
~Base()
{
std::cout << "From Base Destructor" << std::endl;
VirtualFun();
}
void CallVirtual()
{
std::cout << "From Base CallVirtual" << std::endl;
VirtualFun();
}
private:
virtual void VirtualFun()
{
std::cout << "Base::VirtualFun()" << std::endl;
}
};
class Derived : public Base
{
public:
Derived()
{
}
void VirtualFun()
{
std::cout << "Derived::PureVirtual()" << std::endl;
}
};
int main()
{
Derived obj;
obj.CallVirtual();
return 0;
}
Output and explanation
The above program will output,
From Base Constructor
Base::VirtualFun()
From Base CallVirtual
Derived::PureVirtual()
From Base Destructor
Base::VirtualFun()
Here we can find that from the constructor and destructor of base class the local virtual function is getting called and in other cases the virtual function of the appropriate object is called.
So it is not that you cant call a virtual function from a constructor or destructor, but it may not get you what you desired.
As you know the construction is done from base to derived and destruction just opposite you should avoid calling a virtual funciton from both constructor and destructor. When you call the virtual function from constructor the derived class is not yet constructed and if from destructor the derived class is already destructed. Hence compiler will make an arrangement so that the local class(ie Base class) virtual function itself is called.
If you call a pure virtual function from your constructor/destructor directly or indirectly the program will be ill-formed. It can either show a compile time error or else a runtime error "Pure virtual function called"(normaly runtime error occures in case of indirect call from constructor ie calling a local non virtual member function and from that member function a call to pure virtual function ).
Example 1
class Base
{
public:
Base()
{
CallPureVirtual(); // Calling pure virtual indirectly from constructor // LINE 6
}
private:
virtual void PureVirtual() = 0;
void CallPureVirtual(){ PureVirtual(); };
};
class Derived : public Base
{
public:
Derived()
{
}
void PureVirtual()
{
std::cout << "Derived::PureVirtual()" << std::endl;
}
};
int main()
{
Derived obj;
return 0;
}
Output and explanation
If the above program is compiled using gcc/VC++2005/VC6 it will definitly give a runtime error. Because a pure virtual function is called.
Example 2
class Base
{
public:
Base()
{
std::cout << "From Base Constructor" << std::endl;
VirtualFun();
}
~Base()
{
std::cout << "From Base Destructor" << std::endl;
VirtualFun();
}
void CallVirtual()
{
std::cout << "From Base CallVirtual" << std::endl;
VirtualFun();
}
private:
virtual void VirtualFun()
{
std::cout << "Base::VirtualFun()" << std::endl;
}
};
class Derived : public Base
{
public:
Derived()
{
}
void VirtualFun()
{
std::cout << "Derived::PureVirtual()" << std::endl;
}
};
int main()
{
Derived obj;
obj.CallVirtual();
return 0;
}
Output and explanation
The above program will output,
From Base Constructor
Base::VirtualFun()
From Base CallVirtual
Derived::PureVirtual()
From Base Destructor
Base::VirtualFun()
Here we can find that from the constructor and destructor of base class the local virtual function is getting called and in other cases the virtual function of the appropriate object is called.
So it is not that you cant call a virtual function from a constructor or destructor, but it may not get you what you desired.
Subscribe to:
Posts (Atom)