View Issue Details [ Jump to Notes ] | [ Print ] | ||||||||
ID | Project | Category | View Status | Date Submitted | Last Update | ||||
0010530 | ParaView | Bug | public | 2010-04-09 13:38 | 2011-05-16 21:50 | ||||
Reporter | Alan Scott | ||||||||
Assigned To | Ken Moreland | ||||||||
Priority | normal | Severity | minor | Reproducibility | always | ||||
Status | closed | Resolution | fixed | ||||||
Platform | OS | OS Version | |||||||
Product Version | |||||||||
Target Version | Fixed in Version | ||||||||
Summary | 0010530: ParaView does not scale well to huge numbers of cores | ||||||||
Description | ParaView seems to have a problem scaling to huge numbers of cores. I was getting out of resource errors with MPI, random cores, when trying to pass the 5360 core limit (+- a few dozen). For all practical purposes passing this limit is not necessary at this time, but it will be within not too long. | ||||||||
Tags | No tags attached. | ||||||||
Project | |||||||||
Topic Name | |||||||||
Type | |||||||||
Attached Files | |||||||||
Relationships | |||||||||||
|
Relationships |
Notes | |
(0020115) Alan Scott (manager) 2010-04-09 17:12 |
Here is a copy of the output on the server side: [rs181][[21896,1],0][connect/btl_openib_connect_oob.c:847:qp_create_one] error creating qp errno says Resource temporarily unavailable [rs181][[21896,1],0][connect/btl_openib_connect_oob.c:1193:rml_recv_cb] error in endpoint reply start connect [rs181:23477] [[21896,0],0]-[[21896,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) Also, note that the core that is failing seems to be random (this it probably isn't a localized hardware issue), and that it is right at the end of when the ParaView client/server link is being established, at ParaView initialization time. |
(0020182) Ken Moreland (manager) 2010-04-14 13:40 |
I am pretty sure I have traced this problem to vtkPVProgressHandler::CleanupSatellites. Process 0 receives a message from a few nodes, and then everything (but those nodes it received from) locks up. My unverified suspicion is that there is a bunch of unhandled asynchronous progress messages sent to process 0 that are filling up the MPI buffers and not allowing this Cleanup to finish. I also suspect that even when CleanupSatellites completes, there might be several unhandled messages left over. The method attempts to cancel the communication, but even a canceled communication can complete. I don't think that is ever checked. I need to talk this over with Utkarsh. |
(0020991) Utkarsh Ayachit (administrator) 2010-06-11 11:18 |
Ken is absolutely right (no surprise there ;)). The issue is indeed my interpretation on MPI_Test(). From MPI documentation for MPI_Test" "For send operations, the only use of status is for MPI_Test_cancelled or in the case that there is an error, in which case the MPI_ERROR field of status will be set." However, the satellites are using it to determine if the the message was received by the root, which is WRONG. This is resulting in the satellites choking the mpi communication channels with progresses events. |
(0020993) Utkarsh Ayachit (administrator) 2010-06-11 12:03 |
commit 8d0c8aa5288b368e7d4193ad8424e19ea7a28104 Author: Utkarsh Ayachit <utkarsh.ayachit@kitware.com> Date: Fri Jun 11 12:01:22 2010 -0400 Performance improvement for BUG 0010530. Ensuring that progresses are not sent anywhere unless a 2 sec timeout is passed. Reduces the frequency of progress events. There was a bug in vtkProcessModuleConnectionManager which was not initializing the self connection, consequently progress wasn't working in built-in mode. Fixed that as well. |
(0026499) Ken Moreland (manager) 2011-05-11 18:19 |
I believe we addressed the issue that caused this bug. We continue to perform scaling studies on interactive ParaView, but I do not think there is any further need for this bug. |
(0026513) Alan Scott (manager) 2011-05-16 21:50 |
I agree with Ken. Either the bug reported here is a problem with MPI (for which I have a workaround), or IceT - which again we have a replacement. |
Notes |
Issue History | |||
Date Modified | Username | Field | Change |
2010-04-09 13:38 | Alan Scott | New Issue | |
2010-04-09 17:12 | Alan Scott | Note Added: 0020115 | |
2010-04-14 13:40 | Ken Moreland | Note Added: 0020182 | |
2010-04-14 13:40 | Ken Moreland | Status | backlog => tabled |
2010-04-14 13:40 | Ken Moreland | Assigned To | => Ken Moreland |
2010-04-14 13:41 | Ken Moreland | Relationship added | related to 0010261 |
2010-06-11 11:12 | Utkarsh Ayachit | Relationship added | related to 0010672 |
2010-06-11 11:18 | Utkarsh Ayachit | Note Added: 0020991 | |
2010-06-11 12:03 | Utkarsh Ayachit | Note Added: 0020993 | |
2010-09-01 11:27 | Utkarsh Ayachit | Target Version | 4.0 => 3.10.shortlist |
2011-05-11 18:19 | Ken Moreland | Note Added: 0026499 | |
2011-05-11 18:19 | Ken Moreland | Status | tabled => @80@ |
2011-05-11 18:19 | Ken Moreland | Resolution | open => fixed |
2011-05-16 21:50 | Alan Scott | Note Added: 0026513 | |
2011-05-16 21:50 | Alan Scott | Status | @80@ => closed |
Issue History |
Copyright © 2000 - 2018 MantisBT Team |