View Issue Details Jump to Notes ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0010530ParaViewBugpublic2010-04-09 13:382011-05-16 21:50
ReporterAlan Scott 
Assigned ToKen Moreland 
PrioritynormalSeverityminorReproducibilityalways
StatusclosedResolutionfixed 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0010530: ParaView does not scale well to huge numbers of cores
DescriptionParaView seems to have a problem scaling to huge numbers of cores. I was getting out of resource errors with MPI, random cores, when trying to pass the 5360 core limit (+- a few dozen). For all practical purposes passing this limit is not necessary at this time, but it will be within not too long.
TagsNo tags attached.
Project
Topic Name
Type
Attached Files

 Relationships
related to 0010261closed ParaView does not scale above 1024 processors well 
related to 0010672closedUtkarsh Ayachit Slow client side rendering due to communication with server 

  Notes
(0020115)
Alan Scott (manager)
2010-04-09 17:12

Here is a copy of the output on the server side:

[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:847:qp_create_one] error creating qp errno says Resource temporarily unavailable

[rs181][[21896,1],0][connect/btl_openib_connect_oob.c:1193:rml_recv_cb] error in endpoint reply start connect
[rs181:23477] [[21896,0],0]-[[21896,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)


Also, note that the core that is failing seems to be random (this it probably isn't a localized hardware issue), and that it is right at the end of when the ParaView client/server link is being established, at ParaView initialization time.
(0020182)
Ken Moreland (manager)
2010-04-14 13:40

I am pretty sure I have traced this problem to vtkPVProgressHandler::CleanupSatellites. Process 0 receives a message from a few nodes, and then everything (but those nodes it received from) locks up.

My unverified suspicion is that there is a bunch of unhandled asynchronous progress messages sent to process 0 that are filling up the MPI buffers and not allowing this Cleanup to finish.

I also suspect that even when CleanupSatellites completes, there might be several unhandled messages left over. The method attempts to cancel the communication, but even a canceled communication can complete. I don't think that is ever checked.

I need to talk this over with Utkarsh.
(0020991)
Utkarsh Ayachit (administrator)
2010-06-11 11:18

Ken is absolutely right (no surprise there ;)). The issue is indeed my interpretation on MPI_Test(). From MPI documentation for MPI_Test"

"For send operations, the only use of status is for MPI_Test_cancelled or in the case that there is an error, in which case the MPI_ERROR field of status will be set."

However, the satellites are using it to determine if the the message was received by the root, which is WRONG. This is resulting in the satellites choking the mpi communication channels with progresses events.
(0020993)
Utkarsh Ayachit (administrator)
2010-06-11 12:03

commit 8d0c8aa5288b368e7d4193ad8424e19ea7a28104
Author: Utkarsh Ayachit <utkarsh.ayachit@kitware.com>
Date: Fri Jun 11 12:01:22 2010 -0400

    Performance improvement for BUG 0010530.
    
    Ensuring that progresses are not sent anywhere unless a 2 sec timeout is passed.
    Reduces the frequency of progress events.
    
    There was a bug in vtkProcessModuleConnectionManager which was not initializing
    the self connection, consequently progress wasn't working in built-in mode.
    Fixed that as well.
(0026499)
Ken Moreland (manager)
2011-05-11 18:19

I believe we addressed the issue that caused this bug. We continue to perform scaling studies on interactive ParaView, but I do not think there is any further need for this bug.
(0026513)
Alan Scott (manager)
2011-05-16 21:50

I agree with Ken. Either the bug reported here is a problem with MPI (for which I have a workaround), or IceT - which again we have a replacement.

 Issue History
Date Modified Username Field Change
2010-04-09 13:38 Alan Scott New Issue
2010-04-09 17:12 Alan Scott Note Added: 0020115
2010-04-14 13:40 Ken Moreland Note Added: 0020182
2010-04-14 13:40 Ken Moreland Status backlog => tabled
2010-04-14 13:40 Ken Moreland Assigned To => Ken Moreland
2010-04-14 13:41 Ken Moreland Relationship added related to 0010261
2010-06-11 11:12 Utkarsh Ayachit Relationship added related to 0010672
2010-06-11 11:18 Utkarsh Ayachit Note Added: 0020991
2010-06-11 12:03 Utkarsh Ayachit Note Added: 0020993
2010-09-01 11:27 Utkarsh Ayachit Target Version 4.0 => 3.10.shortlist
2011-05-11 18:19 Ken Moreland Note Added: 0026499
2011-05-11 18:19 Ken Moreland Status tabled => @80@
2011-05-11 18:19 Ken Moreland Resolution open => fixed
2011-05-16 21:50 Alan Scott Note Added: 0026513
2011-05-16 21:50 Alan Scott Status @80@ => closed


Copyright © 2000 - 2018 MantisBT Team