


Not only does this eliminate the overhead of two parallelization libraries competing with each other, but also removes that #pragma omp parallel from within the outer loop and some potential thrash in OpenMP overhead. You might be able to get by with lower overhead by just using Intel TBB throughout. Is there any reason you are using OpenMP to cast your initial thread split but Intel TBB for the rest of the parallelization, rather than say, using a TBB parallel-invoke to do the initial task splitting and then carry on with Intel TBB through the rest of the parallel stack? The reason I ask is that while Intel's OpenMP and Intel TBB can coexist and TBB will defer to OpenMP if oversubscription is occurring, there is extra overhead incurred by using both threading models (not to mention multiple thread pools). I would appreciate any suggestions about what this fork/dispatcher overhead is and how it can be reduced.
Histogram maker openmp windows#
Windows Task Manager reports CPU usage around 85%. VTune says that there is no oversubscription and in fact CPU usage is well below the target of 12 for my 6-core hyperthreaded i7. There are other threads executing at the same time.
Histogram maker openmp code#
The two time-consuming sections of code take approximately the same amounts of time, within about +/-20%. That version performs significantly worse than the version shown, as well. Also, I have tried changing the code to use two consecutive parallel regions each containing one parfor. I already know that the function is worth parallelizing in this way because my whole program performs much more poorly if I comment out the pragmas or insert a num_threads(1) clause.

My intention is to have two parfor loops running concurrently. That fork accounts for roughly a third of all CPU time in my program. I have written a function that incurs a tremendous amount of overhead in called by called on behalf of a particular parallel region of mine, according to VTune.
