Research Directions (current and past)
We are currently working on a number of research fronts including topology-aware communication, MPI message queues and matching engine, neighborhood collective communication, GPU-aware communication, high-performance communication for deep learning, communication/computation overlap and message progression, datatype proessing, one-sided communication, in-network computing, and congestion control-aware communication, among others. We will report in a near future.
A summary of some of our past work is provided below, in no particular order:
- Enhancing MPI Remote Memory Access Communication
Remote Memory Access communication has been receiving a lot of attention these days due to advantages that come with a one-sided communication model, such as the availability of direct remote memory hardware support, no involvement required at the destination, decoupling communication and synchronization, and its usefulness for certain classes of applications, among other things. MPI 3.1 has addressed a number of significant issues with the RMA in MPI 2.2; yet, there are still challenges ahead. The research community is currently working on a number of directions to enhance both the standard and the implementation of the MPI RMA model.
Non-blocking Synchronization for MPI RMA
One of the issues with the current MPI RMA standard is its synchronization model, which can lead to serialization and latency propagation. We have proposed entirely non-blocking RMA synchronizations that would allow processes to avoid waiting even in epoch-closing routines. Because the entire MPI-RMA epoch can be non-blocking, MPI processes can issue the communications and move on immediately. Conditions are thus created for (1) enhanced communication/computation overlapping, (2) enhanced communication/communication overlapping, and (3) delay propagation avoidance or mitigation via communication/delay overlapping. The proposal provides contention avoidance in communication patterns that require back-to-back RMA epochs. It also solves all inefficiency patterns, plus a fifth one, late unlock, introduced and documented in our work.
- J.A. Zounmevo, X. Zhao, P. Balaji, W. Gropp, and A. Afsahi, "Nonblocking Epochs in MPI one-sided Communication", 2014 IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing 2014), New Orleans, LA, USA, Nov. 16-21, 2014, pp. 475-486. Best Paper Award Finalist
Message Scheduling to Maximize RMA Communication Overlap
The one-sided communication model of MPI is based on the concept of epoch. An epoch is a region enclosed by a pair of matching opening and closing synchronizations. Inside the epoch, the one-sided communications are always non-blocking. The epoch-closing synchronization is blocking and does not exit until all the communications hosted by the epoch are completed, at least locally. The split-phase tandem created by the RMA communications and the blocking synchronizations creates the adequate conditions for communication/computation overlapping. In this work, we have proposed a message scheduling scheme that interleaves inter-node and intra-node data transfers in a way that minimizes the overall latency of the RMA epoch. We fully exploit the overlapping potential offered by the possible activity of the two engines embodied by RDMA (for network RMA) and the CPU (for intra-node RMA, in the absence of any I/O acceleration technology)
- J.A. Zounmevo and A. Afsahi, "Intra-Epoch Message Scheduling to Exploit Unused or Residual Overlapping Potential", 21st EuroMPI conference, Kyoto, Japan, Sept. 9-12, 2014, pp. 13-19.
- J.A. Zounmevo and A. Afsahi, "Exploiting Unused Middleware-level Parallelism with Large Payload Communication/Communication Overlapping in MPI One-sided Communication", International Journal of High Performance Computing Applications (under revision)
- GPU-aware Communication in MPI
Equipping computing nodes in high-end computing systems with GPU accelerators have shown to be a promising approach to achieve higher performance, improved performance-per-watt and better compute density. In such systems, processors may offload part of their computationally intensive workload to the GPUs. The results of such computations may then need to be communicated among processes on the same or other computing nodes. Therefore, processes with their data residing in the GPU global memory require efficient support from the MPI library for high-performance communication. It has been shown that intra-node and inter-node communications among GPUs in such platforms have an important role on the performance of scientific and engineering applications. We have proposed two design alternatives for a GPU-aware intra-node MPI_Allreduce operation (and other collectives, for that matter) that perform the reduction operations within the GPU and leverage CUDA IPC for communication among processes involved in the collective operation.
- I. Faraji and A. Afsahi, "GPU-Aware Intranode MPI_Allreduce", 21st EuroMPI conference, Kyoto, Japan, Sept. 9-12, 2014, pp. 45-50.
In another work, we have evaluated the effect of the MPS service on GPU-to-GPU communication using CUDA IPC and host-staged approaches. We have shown that the MPS service is indeed beneficial when multiple inter-process communications are in place. However, it is still required to make efficient design decisions to further harness the potential of this service. To this aim, we have proposed two design alternatives for intra-node MPI_Allgather and MPI_Allreduce operations: a Static and a Dynamic approach. While the two approaches use different algorithms they both use a mix of host-staged and CUDA IPC copy in the design of collectives.
- I. Faraji and A. Afsahi, "Hyper-Q-Aware Intranode MPI Collectives on the GPU", International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)", Austin, TX ,Nov. 15, 2015.
- Efficient Message Queues in MPI
A minimum of two message queues are required, both at the receive-side, to allow MPI communication operations. They are the unexpected message queue (UMQ) and the posted receive queue (PRQ). Message queues are solicited in point-to-point, collective and even modern RDMA-based implementations of Remote Memory Access (RMA) operations. MPI message queues have been shown to grow proportionately to the job size for many applications. With such a behavior and knowing that message queues are used very frequently, ensuring fast queue operations at large scales is of paramount importance in the current and the upcoming exascale computing eras. At the same time, a queue mechanism that is blind on memory requirements poses another scalability issue even if it solves the speed of operation problem.
In this work, we have proposed a scalable multidimensional queue traversal mechanism to provide fast and lean message queue management for MPI jobs at large scales. We resort to multiple decompositions of the search key. The proposal, built around a multidimensional data structure, exploits the characteristics of the contextId and rank components to considerably mitigate the effect of job sizes on the queue search times. We have done a runtime complexity analysis and a memory footprint analysis of the proposed message queue data structure with those of link list and array-based designs.
- J.A. Zounmevo and A. Afsahi, "An Efficient MPI Message Queue Mechanism for Large-scale Jobs", 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Singapore, Dec. 17-19, 2012, pp. 464-471.
- J.A. Zounmevo and A. Afsahi, "A Fast and Resource-Conscious MPI Message Queue Mechanism for Large-Scale Jobs", Future Generation Computer Systems, 30(1):265-290, Jan. 2014
- High-Performance Distributed Services
HPC distributed services, such as storage systems, are hosted in servers that span several nodes. They interact with clients that connect or disconnect when they need to. Such distributed services require network transports that offer high bandwidth and low latency. However, unlike HPC programs, distributed services are persistent services as they bear no concept of completion. These services are typically written in user space and require user-space networking APIs. In order to reduce porting efforts over modern networks, distributed services benefit from using a portable network API. All HPC platforms include an implementation of MPI as part of their software stack. Since MPI is one of the primary ways of programming these machines, the bundled MPI implementation is typically well optimized and routinely delivers maximum network performance. In this work, we have evaluated the use of MPI as a network portability layer for cross-application services.
- J.A. Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, "Using MPI in High-Performance Computing Services", 20th ACM EuroMPI Conference, Recent Advances in the Message Passing Interface, Madrid, Spain, Sept. 15-18, 2013, pp. 43-48.
- J.A. Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, "Extreme-Scale Computing Services over MPI: Experiences, Observations and Features Proposal for Next Generation Message Passing Interface", International Journal of High Performance Computing Applications. 28(4):435-449, Sept. 2014
In another collaborative effort, an asynchronous remote procedure call (RPC) interface, Mercury, has been designed to serve as a basis for higher-level frameworks such as I/O forwarders, remote storage systems, or analysis frameworks that need to remotely exchange or operate on large data in a distributed environment.
- J. Soumagne, D. Kimpe, J.A. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. Ross, "Mercury: Enabling Remote Procedure Call for High-Performance Computing", 15th IEEE International Conference on Cluster Computing (Cluster), Indianapolis, IN, Sept. 23-27, 2013, pp. 1-8
- Topology-aware Communication
With the emerging multi-core architectures and high-performance interconnects offering more parallelism and performance, parallel computing systems are becoming increasingly hierarchical in their node architecture and interconnection networks. Communication at different level of hierarchy demonstrates different performance levels. It is therefore critical for the communication libraries to efficiently handle the communication demands of the HPC applications on such hierarchical systems. We have designed the MPI non-distributed topology functions for efficient process mapping over hierarchical clusters. We have integrated the node physical topology with network architecture and used graph-embedding tools inside the MPI library to override the current trivial implementation of the topology functions and efficiently reorder the initial process mapping.
- M.J. Rashti, J. Green, P. Balaji, A. Afsahi and W. Gropp, "Multi-core and Network Aware MPI Topology Functions", 18th EuroMPI conference, Recent Advances in the Message Passing Interface, Santorini, Greece, Sept. 18-21, 2011, Lecture Notes in Computer Science (LNCS) 6960, pp. 50-60.
- Efficient Message Progression, overlap, and Rendezvous Protocol
Overlap and Message Progression Ability
We analyze how MPI implementations support communication progress and communication/computation overlap on top of modern interconnects. This works contributes by providing a better understanding of the ability of contemporary interconnects and their MPI implementations in supporting communication progress, overlap and offload. Our study confirms that the offload ability needs to be supported with independent communication progress to increase the level of overlap.
- M.J. Rashti and A. Afsahi, "Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects", 15th Annual IEEE Symposium on High-Performance Interconnects (Hot Interconnects), Palo Alto, CA, Aug. 22-24, 2007, pp. 117-124.
A Speculative and Adaptive Rendezvous Protocol
Our earlier study showed that transferring large messages does not make progress independently, decreasing the chances of overlap in applications. This confirms that independent progress is required, at least for data transfer, to achieve high overlap ability with non-blocking communication. We have proposed a novel speculative Rendezvous protocol that uses RDMA Read and RDMA Write to effectively improve communication progress and consequently the overlap ability. In this proposal, the early-arrived receiver predicts the communication protocol based on its own local message size. If the predicted protocol is Rendezvous, a message similar to RTS (we call it Request to Receive or RTR), including the receiver buffer address is prepared and sent to the sender. At the sender side, if the Rendezvous protocol is chosen, the arrived RTR message is used to transfer the data to the receiver using RDMA Write. Otherwise, if the Eager protocol is chosen, the arrived RTR will be simply discarded.
- M.J. Rashti and A. Afsahi, "Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects", 22nd International Symposium on High Performance Computing Systems and Applications (HPCS), Quebec City, QC, June 9-11, 2008, pp. 95-101
- M.J. Rashti and A. Afsahi, "A Speculative and Adaptive MPI Rendezvous Protocol over RDMA-enabled Interconnects", International Journal of Parallel Programming, 37(2):223-246, 2009
An Asynchronous Message Progression Technique
In another work, we looked into the potential issues with protocol enhancement approaches for large message transfer progression. One of the issues is its inability to work with MPI_ANY_SOURCE scenario when the Rendezvous is receive initiated. We have proposed a lightweight asynchronous message progression mechanism for large message transfers in MPI Rendezvous protocol that is scenario-conscious and consequently overhead free in cases where independent message progression naturally happens. Without requiring a dedicated thread, we take advantage of small bursts of CPU to poll for message transfer conditions. The existing application thread is parasitized for the purpose of getting those small bursts of CPU.
- J.A. Zounmevo and A. Afsahi, "Investigating Scenario-conscious Asynchronous Rendezvous over RDMA", 13th IEEE International Conference on Cluster Computing (Cluster), Austin, TX, Sept. 26-30, 2011, pp. 542-546.
- MPI Interoperability with Active Messages
Many new large-scale applications have emerged recently and become important in areas such as bioinformatics and social networks. These applications are often data-intensive and involve irregular communication patterns and complex operations on remote processes. In such algorithms the receiver may not know how many messages to expect or even from which receivers to expect messages; therefore, active messages are considered effective for parallelizing such nontraditional applications. In this collaborative effort, an active messages framework inside MPI (on top of MPI RMA) has been developed to provide portability and programmability.
- X. Zhao, D. Buntinas, J.A. Zounmevo, J. Dinan, D. Goodell, P. Balaji, R. Thakur, A. Afsahi, and W. Gropp, "Towards Asynchronous, and MPI-Interoperable Active Messages", 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Delft, The Netherlands, May 13-16, 2013, pp. 87-94.
- RDMA over Unreliable Datagrams
The current iWARP standard is only defined on reliable connection-oriented transports. Such a protocol suffers from scalability issues in large-scale applications due to memory requirements associated with multiple inter-process connections. In addition, some applications and data services do not require the reliability overhead and implementation complexity and cost associated with connection-oriented transports such as TCP. Many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only defined over reliable, connection-oriented transports. We have proposed to extend the iWARP standard on top of the User Datagram Protocol (UDP) in order to utilize the inherent scalability, low implementation cost and the minimal overhead of datagram protocols. We have provided guidelines and discussed the required extensions to different layers of the current iWARP standard in order to support the connectionless UDP transport. Our proposal is designed to co-exist with and to be consistent and compatible with the current connection-oriented iWARP.
- M.J. Rashti, R.E. Grant, P. Balaji, and A. Afsahi, "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet", 17th International Conference on High Performance Computing (HiPC), Goa, India, Dec. 19-22, 2010, pp. 1-10.
In a follow-up work, we proposed the first RDMA operation over unreliable datagrams that can significantly increase iWARP performance and scalability and expand the application space that iWARP can serve to include some very network intensive applications. In order to support RDMA over unreliable datagrams, we proposed RDMA Write-Record, a proposal for the design and implementation of, what is to our knowledge, the first RDMA design over an unreliable datagram transport. It is designed to be extremely lightweight and to be used in an environment in which packet loss occurs.