Multi Core Processing

From ParaQ Wiki
Revision as of 06:45, 13 October 2008 by Biddisco (talk | contribs) (→‎Multiblock parallelism)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This page captures the initial thoughts on multi core processing. Currently, ParaView (basically) only supports MPI-based parallel processing. This type of parallelism concerns the latest generation of multicore processors that are common now.

Motivation

In some sense, the parallelization of ParaView is done. ParaView already supports efficient parallel processing with MPI, which is the "hard" version of distributed memory parallel processing. So far, we have made use of multi core processors by running multiple process on each node to take advantage of the multiple cores as if they had separate memory.

This works OK, but it is not the most efficient use of these SMP parallel processes. An even bigger problem is that this method only really works when running in client/server mode. A very valid mode of operation that is becoming commonplace is the usage of a high end workshop with a fair number of cores (4-8) to do visualization of moderate size data sets (moderate in comparison to some of the large simulations done at Sandia). Right now, the only way to take advantage of the multiple cores is to launch a server locally, and that is a waste of resources on many levels.

Thus, it is becoming vital that we natively support multicore processors.

Approaches

There are multiple general approaches we can make.

Multiblock parallelism

Multiblock support within ParaView is getting better and better. We are also seeing more data being multiblock. Since all of these blocks are in core anyway, one easy way to take advantage of this split of the data is to run the processing of each block in a different thread (since many algorithms are run independently on blocks anyway.

Pros:

  • Lots of parallelism for "free." We only implement the parallelism once in the pipeline and the simple filters automatically are parallelized.
  • Encourages readers and filters to keep pieces separated, even when they belong to the same logical block (using vtkMultiPieceDataSet). For example, the Exodus reader spends a lot of time appending pieces from separate files together.
  • Changes to the executive to support keys like GEOMETRY_NOT_MODIFIED will continue to work, even when a block is part of a multi-block structure. This would allow the executive to spawn N-threads to execute on data and each core/thread can honour the flag. Currently, because the data is looped over the filter block-by-block, the geometry cannot be considered fixed. We would like to tell the executive to not re-initialize the output for filters which can re-use the last result.

Cons:

  • Only works for multiblocks and at whatever granularity the blocks in the data are stored. That does not exactly guarantee efficiency.
  • The way the pipeline iterates over multiblocks is proving to be not the fastest way to handle the multiblock data sets, especially when there are many blocks. Efficiency may dictate writing special filters to handle the multiblock anyway.
  • An issue with vtkMultiPieceDataSet (especially with unstructured data) is that proper handling of these will require the processing of ghost cells. We will need to generate algorithms to do this, they may have to run automatically, and they may take a large computational overhead which may or may not parallelize well.
  • I do not expect very many filters to actually be thread safe. At the least, the initialization phase may be accessing and writing to some field variables. In the end, we may need to do a lot of editing on filters anyway.

Customizing Filters

Although there are an awful lot of filters in VTK, a small subset are exposed in ParaView. An even smaller subset is commonly used. We could parallelize ParaView functionality simply by optimizing filters for threads independently. We start with the most common and compute intensive filters and move down.

Pros:

  • As each filter is independently optimized, we are likely to get the most efficient form of parallelism this way.
  • Parallelism granularity not dependent on partitioning of the data.
  • Takes true advantage of the shared memory nature of the processors.

Cons:

  • More work (but probably easier work) to parallelize each filter one at a time.
  • Inevitably, some filters will remain non-parallel.

Threading Methods

A related but separate decision on multi core support is the method used to support parallel programming.

VTK threads

VTK already has thread support, and the threads have been used to parallelize some processes (ray casting for volume rendering comes to mind). The interface is simple: spawn a function in a separate thread. There are also mutexes in VTK and we have some semaphore code lying around that are available for synchronizing.

Pros:

  • Already integrated into VTK and ported to multiple platforms. Does not require yet another software dependency.

Cons:

  • A no-frills implementation of parallelism. It does not provide much help. Possibly this will make it the hardest to implement.

OpenMP

OpenMP is a set of standard extensions to the C/C++ preprocessor that provides hints to a special compiler that can unroll and parallelize loops. This provides a much easier way to perform SIMD types of parallelism. There are also some features for reduction of the results in the end.

Pros:

  • A simplified way of parallelizing an algorithm that is gaining popularity.

Cons:

  • From what I've seen (and I admit I haven't seen much), OpenMP works best on operations on simple arrays. It may not be easy to apply OpenMP on our more complicated stuctures (unstructured grids, poly data).
  • OpenMP requires a special compiler that understands the special pragmas that specify how to parallelize the code (or at least augment an existing compiler). That will complicate just about any build and may make it impossible for many platforms.

Threading Building Blocks

Thread Building Blocks is a runtime library that abstracts the low-level threading details (I just plagiarized that from the TBB site). TBB contains some helpful code for parallelizing and synchronization. It also has containers. There are also simple algorithms that look similar to the primitives in OpenMP.

To be honest, I know next to nothing about TBB. Some one should check it out and add more details here. There is a book available: http://shop.intel.com/shop/product.aspx?pid=SISW4001

Pros:

  • Should make parallelizing easier than basic threading. It also seems to have at least as many facilities as OpenMP.

Cons:

  • The GPLv2 with runtime exception license means anyone can get the library for free and use it with VTK/ParaView without changing the latter's BSD license. However, that may preclude us from embedding the library inside of VTk or ParaView, assuming we would want to do that in the first place. Thus we have the problem of yet-another-library to link to. We also will also have to have multiple versions of the algorithm to work with and without TBB.

Acknowledgments

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

SAND 2008-1529 P