Monday, December 10, 2007

What if someone come with a compiler capable of parallelizing the algorithms?

In many training sessions I have heard a common question from most of the business men. If I give training to all my employees on parallel computing can you guarantee me that no one will come up with an automatically parallelizing compiler tomorrow? Won't all the money I spent on the training get wasted?

Is it a big deal?

This has been a very serious question for all these time. May be most of the companies are still thinking something will happen and the performance will get improved automatically. But it is quite easy to prove that nothing is going to workout on automatic parallelizing of an algorithm.

How easy it is to prove?

May be someone can come up with a compiler that does some amount of parallelism. And even they already exist!!! But definitely it cannot parallelize the logic of your algorithm, isn’t it? It’s for sure that the core logic cannot be parallelized without a skilled programmer and an algorithm expert. A compiler can do some amount of logic less parallelism which is not going to increase the performance drastically. For a drastic improvement almost all of the algorithm implementations need to be rewritten in a parallel way. During those days when the algorithms were implemented there was no parallel computing on a personal computer. Only super computers had parallel processors. Now the time has changed. Almost all the desktop users in the world can afford to have a multi core machine in front. Even most have a high end graphics card inside as well. So it is for sure if you get skilled in parallel programming it is not going to be a waste. No compiler can be intelligent enough like you to do the parallelism. Only skilled human can do the parallel programming. Get yourself trained or leave your job to others!!!

Wednesday, December 5, 2007

Think parallel, save your product

I always wonder why we all speak about performance. We uses many jargons like performance, optimization, throughput, scalability, parallel thinking, hybrid computing, threads, locks, semaphores, synchronization, penalty, data parallelism, instruction parallelism and many more.

Why we are aiming on performance? What happened to the software industry in a very short span? Instead of just making new and new software why everyone thinks about increasing the performance of the existing ones?

What is the motivation?

Motivation number 1... The customer satisfaction...

I never want to know how complex the algorithm is, I want to get it done in slice of a second...
I paid $10000 for your product and now there are 100 other products which takes a second to do what you gets done in a day...You cheated me?
My system is having a quad core processor and you takes 100 seconds to get things done. When I look the CPU usage it is just 25%, what are you doing inside your program? Why did i buy a quad core machine spending all my money?

Motivation number 2... A stitch in time saves nine...

What if your software takes an hour to detect the spread blood inside brain? The patient may die before you detect the problem.
What if it takes an hour to diagnose your vehicles electrical problem? Don't you have tight schedule?

Motivation number 3... There is no point in watering a dead plant...
What is the use if you detect a tsunami after it has hit the shores?
What is the use if you predict tomorrows weather 2 days later?
What are you going to achieve if you detect a fire in a chamber and the alarm gets triggered 10 seconds late?

I conclude it like this.
So it is sure that time matters. If your competitor do a thing much faster than you what happens? The answer is clear. You can’t compete anymore.

Parallel Extension to .NET Framework

Something to hope for .NET programmers!!!

Microsoft has come with a parallel extension to .NET framework (managed code). This may be a revolution in making high performance programs using .NET. But I wonder how many HPC applications can be done using .NET because of its lack of speed compared to a C/C++ program. I have written another article about the performance difference between C++ and c# at To make use of all the cores in a Multicore environment we definitely need threading. So using of parallel extensions can make the program run quite faster on a system with more than one core. Even if we use the parallel extension it will be little possible for a managed code to run as faster as a unmanaged code. So for performance either C or C++ is the best. For more information about parallel extension visit the blog

Still C# zags in performance?

C# increases productivity by compromising performance. But if someone else makes the same application using C++ which takes half the time; what is the use of that productivity? May be you can start selling earlier and stop selling earlier.

Monday, December 3, 2007

High performance computing

As human being greedy we will never settle down with what we have got. We will always look for more. That is exactly happening in high performance computing industry. Long back we had very slow processors which took seconds to sort a small chunk of data. Now we want to predict the climate of each and every location of world within seconds. There are lots of medical imaging algorithms waiting in the shore for more computation power. There are lots of generic algorithms which would solve many problems in the real world which needs a bulk more computational power. So that is why the high performance computing industry is quite hot.

The business

The first name that comes when looking into high performance computing is Intel. They are coming up with processors with more and more cores. Then there is AMD in the form of CPU and GPU(ATI is now AMD's). But the leader in GPU is still NVIDIA with their latest GPU 8800 ultra which have 128 SIMD cores. There is a brand new architecture from IBM called IBM Cell, Intel is going to release larabee next year.

Looking into software Intel have their own compiler which compiles High level code to machine code with highest optimization for CPU, NVIDIA have CUDA for doing the general purpose programs in GPU, There is directx & HLSL from microsoft for GPU, There is Cg and CgFx from NVIDIA, There is OpenGL from ARB, and RapidMind have their own stream programming libraries. And we can’t hear a brand peak stream now because google bought it.

The Free Lunch Is Over, A Fundamental Turn Toward Concurrency in Software - Herb Slutter

Till today the programmers didn’t need to think much about performance optimization. The hardware vendors were keeping on improving their hardware which needs no change in software to improve the performance. But that have reached its limit. The clock speed can’t be increased anymore; the power can’t be increased due to heat dissipation; the physics is catching up. The free lunch is over. Now it is multicore. It is parallel thinking which can improve the performance.

Think parallel or perish - Intel

Intel say either think parallel or get perished. If you still keep on writing a single threaded serial code your software will be outdated. Do you think someone going to buy your software when one other can do the same thing in one tenth of your time? It is only parallelism which can improve he performance now.


What is this GP in GPU? Looking wierd? But it is reality. You can do a lot of multi threaded application using GPU which does general purpose tasks instead of usual graphics tasks. CUDA from NVIDIA is the best way to make GPGPU program.


If your algorithm is not parallel; If your application is still running on single thread; If you still keep on thinking someone else will speedup your algorithm.; You will be perished.. Your algorithm will not have existence. Better late than never!!!

Saturday, December 1, 2007

CUDA - Compute Unified Device Architecture

Compute Unified Device Architecture is an easy way to use GPU for General Purpose Programming. No graphics knowledge is required to use CUDA for doing a program using GPU. A CUDA program is almost same as a C program but have some additional features. In CUDA a function can be run in many threads by giving a execution configuration while calling a function. There are 3 kinds of functions in CUDA. A device function which can be only executed in the device and called from device, a global function which can be called from host(CPU) using some configuration and gets executed in device, and a pure host function which must be executed in the CPU only.

There are some additional specifiers to distinguish the function type.

__device__ - if a function is suffixed with __device__ it becomes a device function which can only be executed at device and which can only be called from a function that executes on device.
__host__ - These functions are the normal C functions that can be executed on the host(CPU)
__global__ - A function suffixed with __global__ can be called from CPU. But for calling this function the execution configuration must be mentioned. The execution configuration decides how many threads and blocks have to be made for executing this function.

Also there are different kinds of memory,

__shared__ - if this is prefixed that memory becomes a shared memory and it can be shared across threads. This is the fastest memory.
__constant__ - a memory to which we can write from host only.
__device__ - a memory to which we can write from both device and host.

Each thread will have a thread ID and block ID to know which area of the data need to be processed by this thread. This is the tricky area where all the performance improvement lies.