[Paraview] Creating Large Datasets from CSV

Cory Quammen cquammen at cs.unc.edu
Wed Aug 15 15:04:03 EDT 2012


I'm CC'ing the list in case it helps others or others have insight.

I'm accessing the Xdmf2 and H5 libraries through C++. The XML
descriptions generated by our translator are sometimes several
megabytes. They are large because there are sometimes many hundreds of
time steps and a dozen or so fields, and the data is replicated in
several grids in the various coordinate spaces that our collaborators
want to see.

I did some profiling of the Xdmf2 library and the problem seems to be
that when the XML text is being generated, the buffer reallocation
strategy is to create a new buffer with just enough room to hold the
newly serialized XML, and the text in the previous buffer is copied
into the new buffer. In other words, there is a lot of excess copying
going on. I modified my local copy of the Xdmf2 library to change the
memory reallocation strategy to double the size of the buffer any time
a reallocation is needed, and that sped up the serialization
tremendously. Building the Xdmf tree still took a long time, but I
haven't looked into why that is slow.

I may be wrong, but Xdmf2 doesn't seem to be under active development
and there doesn't seem to be much community support of it, which is a
shame because it is a nice format.


On Wed, Aug 15, 2012 at 2:50 PM, David Zemon <david.zemon at mst.edu> wrote:
> Cory,
> What takes so long in the XML description? For me, all of my time goes into
> translating the data and mostly file I/O. Attached is an example of one Xdmf
> file I've been working with.
> Also, what language are you using?
> Cheers,
> David
> On 08/15/2012 01:36 PM, Cory Quammen wrote:
>> David,
>> I'm currently working on a translator that sounds very similar to
>> yours. It uses HDF5 for the heavy data and that part works fine.
>> For what it's worth, my experience is that the Xdmf library is
>> painfully slow when serializing a large tree. For some of the data
>> sets that I work with, writing the XML description of the data takes
>> far longer than writing the HDF5 files.
>> Cory
>> On Wed, Aug 15, 2012 at 2:29 PM, David Zemon <david.zemon at mst.edu> wrote:
>>> Cory,
>>> Unfortunately no, it isn't. Reading CSV is just a small stepping-stone in
>>> the overall goal of this project. I'm trying to make a reader that will
>>> convert any text-delimited file of any size (we have professors on campus
>>> with Terabytes of data - necessitating that it run separately from
>>> ParaView). I also plan to give the user options like creating a
>>> difference
>>> field between a column in one file and another column in a different
>>> file.
>>> David
>>> On 08/15/2012 01:19 PM, Cory Quammen wrote:
>>>> David,
>>>> Just curious, is ParaView's CSV reader not sufficient for reading your
>>>> files?
>>>> Cory
>>>> On Wed, Aug 15, 2012 at 1:58 PM, David Zemon <david.zemon at mst.edu>
>>>> wrote:
>>>>> Hello,
>>>>> I'm creating a reader to convert large dataset from CSV to a ParaView
>>>>> readable format. XDMF was chosen because it seems like a simple-to-use
>>>>> and
>>>>> understand format. This worked well while I was testing small datasets
>>>>> but
>>>>> when I scaled up to larger data, I ran across a problem where the XML
>>>>> node
>>>>> was too large (could not have 350,000 rows).
>>>>> I want to make sure now that I am on the right track. I've decided to
>>>>> start
>>>>> researching the HDF5 format and will place all of my data into an HDF5
>>>>> file
>>>>> and then include that in the XDMF file. Does this seem reasonable? Is
>>>>> there
>>>>> a better way to do it?
>>>>> Thank you,
>>>>> David Zemon
>>>>> _______________________________________________
>>>>> Powered by www.kitware.com
>>>>> Visit other Kitware open-source projects at
>>>>> http://www.kitware.com/opensource/opensource.html
>>>>> Please keep messages on-topic and check the ParaView Wiki at:
>>>>> http://paraview.org/Wiki/ParaView
>>>>> Follow this link to subscribe/unsubscribe:
>>>>> http://www.paraview.org/mailman/listinfo/paraview

Cory Quammen
Research Associate
Department of Computer Science
The University of North Carolina at Chapel Hill

More information about the ParaView mailing list