A volatile reference

void foo(volatile double& bar )

{

cout << bar << endl;

}

Above function accepts a reference to double. What happens if you call,

int nVal = 0;

Foo( nVal );

An error will be generated by the compiler specifying the conversion from ‘ int ’ to ‘ volatile double & ’ is not possible.

Why it shouldn’t cast?

A casting has to be don when an integer need to be passed to a method which takes double as input. When we do a standard conversion from int to double a temporary object will be created with the help of implicit conversion. A function which takes volatile reference parameter can change the value of the parameter. Now let us map these two things together. When we do the implicit conversion a temporary object is created and if we pass that temporary object to a method which accepts volatile reference what happens? The method may modify the temporary object which will not affect the original one. For example,

void foo(volatile double& bar )

{

bar = 0;

}

int nVal = 10;

foo(nVal);

cout << nVal;

What we expect at the output? Here we intended to set the value of nVal as 0. But what happens if the above code runs? An implicit conversion has to be done from into to double and the resultant temporary is passed to method foo. So function foo sets the value of temporary object instead of nVal. Now it is great to see why the function with reference to double doesn’t compile while called with int.

A constant reference

void foo(const double& bar )

{

cout << bar << endl;

}

Now what should happen? Function foo accepts a const reference to double which in terms guarantees no update to the parameter bar. In this case it is safe to pass a temporary object. So the compiler will allow the following call.

int nVal = 0;

Foo( nVal );

Conclusion

If a temporary object could be used for calling a volatile reference function there might have been many hard to find bugs.

## Thursday, February 21, 2008

## Tuesday, February 12, 2008

### Compromising quality for performance

It has always been a big deal to compromise accuracy for performance. In most of the cases highly complex and time taking application will need highest accuracy. The accuracy has always been the problem in using GPU for performance improvement of such algorithms. Since the GPUs don’t support double precession arithmetic it looks hard to achieve high precision with it. CPU does floating point division using double precision arithmetic. But even the latest GPU from nVIDIA(8800) uses reciprocal multiplication with single precession for division.

There will be situations when you have to deal with very convoluted shapes. In such cases it becomes hard to settle down for floating point accuracy. In other way if CPU is used it might be impractical to get the algorithm working at real-time. In such cases the tough question comes. Can accuracy be compromised for performance?

If the answer is Yes!

If it is not so important to get the highest accuracy we can of course go for GPU. The massive computation power can be used to get the algorithm executed in real-time. It becomes an easy way of optimizing your algorithm by allowing a % of tolerance to the output. In this case you must be sure that the tolerance comes into a range which makes the algorithm usable.

If it is a Big No!

Here comes the problem. You have an algorithm which is not executable in real-time because of less computation power you have with available resource. You must settle for the single precision arithmetic with GPU. Now what can we do to improve performance with highest accuracy. An idea is to use both CPU and GPU for the execution. At first run all the parallel code using GPU and calculate the output. Then calculate the tolerance using a CPU version of the code which gives highest accuracy. Now do some very less amount of iterations of your algorithm using CPU to find the best value.

A case study

Suppose you have to do registration between two 3D surfaces. At first calculate the parameters needed for registering both surface using GPU. It may take a lot of iteration to find the rotation, translation, scaling and shearing parameters. When the correct registration parameters are found using GPU do some iteration with the CPU to find the best convergence. So now you can achieve performance improvement by doing more number of iterations in GPU. The accuracy is also good since we done a CPU based calculation at the end with the help of approximate parameters calculated using GPU. This strategy can be used in most of the cases where highest amount of accuracy is needed.

There will be situations when you have to deal with very convoluted shapes. In such cases it becomes hard to settle down for floating point accuracy. In other way if CPU is used it might be impractical to get the algorithm working at real-time. In such cases the tough question comes. Can accuracy be compromised for performance?

If the answer is Yes!

If it is not so important to get the highest accuracy we can of course go for GPU. The massive computation power can be used to get the algorithm executed in real-time. It becomes an easy way of optimizing your algorithm by allowing a % of tolerance to the output. In this case you must be sure that the tolerance comes into a range which makes the algorithm usable.

If it is a Big No!

Here comes the problem. You have an algorithm which is not executable in real-time because of less computation power you have with available resource. You must settle for the single precision arithmetic with GPU. Now what can we do to improve performance with highest accuracy. An idea is to use both CPU and GPU for the execution. At first run all the parallel code using GPU and calculate the output. Then calculate the tolerance using a CPU version of the code which gives highest accuracy. Now do some very less amount of iterations of your algorithm using CPU to find the best value.

A case study

Suppose you have to do registration between two 3D surfaces. At first calculate the parameters needed for registering both surface using GPU. It may take a lot of iteration to find the rotation, translation, scaling and shearing parameters. When the correct registration parameters are found using GPU do some iteration with the CPU to find the best convergence. So now you can achieve performance improvement by doing more number of iterations in GPU. The accuracy is also good since we done a CPU based calculation at the end with the help of approximate parameters calculated using GPU. This strategy can be used in most of the cases where highest amount of accuracy is needed.

## Monday, February 11, 2008

### Did Herb Sutter fight with the Amdahl’s law?

Understanding Amdahl’s law

No matter how much speedup you get for the parallel code, it is impossible to make a speedup of 2x if 75% of the algorithm cannot be parallelized. Suppose if you do the other 25% with 0 seconds that makes you gain up to 1.3x. For quantifying, if your algorithm takes 1 second and if 0.75 second of it is non-improvable, it doesn’t matter how much improvement you make for other 0.25 seconds, you cant get the total time scaled down to half.

Understanding Gustafson’s law

Rather than speedup Gustafson takes work into consideration. It states if your algorithm takes 1 second to complete and you have 0.75 second non-improvable part you can still get 2x speedup if you add a new feature which takes another 1 second in which 0.75 second is improvable. It means in the total 2 seconds, 1 second (0.75+0.25) of the code is non improvable and another half is improvable. Suppose you get infinite improvement which in terms reduces the time taken from 0.25+0.75 second to 0 second, total time of the algorithm goes down to 1 second. It means you got 2x speeds by adding more work.

Again did Herb fought the Amdahl’s law and won?

Negative. Herb didn’t even try to fight with the Amdahl’s law. Herb only proves it is better to take Gustafson’s law into consideration while deciding on going for parallelization or not. It is Amdahl’s law which must be taken in to consideration on calculating the amount of speedup that can be achieved by increasing number of cores. But the importance is you should not think negative due to the results of Amdahl’s law calculation. It is much more practical to understand your application will have more features added in future and those may be drastically improvable by executing tasks in parallel. It also means number of cores must be taken in to consideration rather then fixed size problems.

Conclusion

Break Amdahl’s law – It is a correct attitude. You must not get into a deadlock by finding the algorithm is not much improvable because of Amdahl’s law results.

“Herb fought the law—Amdahl's Law, that is—and Herb won-DDJ”- It is not a good title. If someone has ever tried to break the Amdahl’s law it is Gustafson. And even Gustafson named his paper “Reevaluating Amdahl's Law”

No matter how much speedup you get for the parallel code, it is impossible to make a speedup of 2x if 75% of the algorithm cannot be parallelized. Suppose if you do the other 25% with 0 seconds that makes you gain up to 1.3x. For quantifying, if your algorithm takes 1 second and if 0.75 second of it is non-improvable, it doesn’t matter how much improvement you make for other 0.25 seconds, you cant get the total time scaled down to half.

Understanding Gustafson’s law

Rather than speedup Gustafson takes work into consideration. It states if your algorithm takes 1 second to complete and you have 0.75 second non-improvable part you can still get 2x speedup if you add a new feature which takes another 1 second in which 0.75 second is improvable. It means in the total 2 seconds, 1 second (0.75+0.25) of the code is non improvable and another half is improvable. Suppose you get infinite improvement which in terms reduces the time taken from 0.25+0.75 second to 0 second, total time of the algorithm goes down to 1 second. It means you got 2x speeds by adding more work.

Again did Herb fought the Amdahl’s law and won?

Negative. Herb didn’t even try to fight with the Amdahl’s law. Herb only proves it is better to take Gustafson’s law into consideration while deciding on going for parallelization or not. It is Amdahl’s law which must be taken in to consideration on calculating the amount of speedup that can be achieved by increasing number of cores. But the importance is you should not think negative due to the results of Amdahl’s law calculation. It is much more practical to understand your application will have more features added in future and those may be drastically improvable by executing tasks in parallel. It also means number of cores must be taken in to consideration rather then fixed size problems.

Conclusion

Break Amdahl’s law – It is a correct attitude. You must not get into a deadlock by finding the algorithm is not much improvable because of Amdahl’s law results.

“Herb fought the law—Amdahl's Law, that is—and Herb won-DDJ”- It is not a good title. If someone has ever tried to break the Amdahl’s law it is Gustafson. And even Gustafson named his paper “Reevaluating Amdahl's Law”

Subscribe to:
Posts (Atom)