Parallel Processing Research Laboratory (PPRL)

Research Directions (current and past)

We are currently working on a number of research fronts including topology-aware communication, MPI message queues and matching engine, neighborhood collective communication, GPU-aware communication, high-performance communication for deep learning, communication/computation overlap and message progression, datatype proessing, one-sided communication, in-network computing, and congestion control-aware communication, among others. We will report in a near future.

A summary of some of our past work is provided below, in no particular order:

Enhancing MPI Remote Memory Access Communication

Remote Memory Access communication has been receiving a lot of attention these days due to advantages that come with a one-sided communication model, such as the availability of direct remote memory hardware support, no involvement required at the destination, decoupling communication and synchronization, and its usefulness for certain classes of applications, among other things. MPI 3.1 has addressed a number of significant issues with the RMA in MPI 2.2; yet, there are still challenges ahead. The research community is currently working on a number of directions to enhance both the standard and the implementation of the MPI RMA model.

Non-blocking Synchronization for MPI RMA

One of the issues with the current MPI RMA standard is its synchronization model, which can lead to serialization and latency propagation. We have proposed entirely non-blocking RMA synchronizations that would allow processes to avoid waiting even in epoch-closing routines. Because the entire MPI-RMA epoch can be non-blocking, MPI processes can issue the communications and move on immediately. Conditions are thus created for (1) enhanced communication/computation overlapping, (2) enhanced communication/communication overlapping, and (3) delay propagation avoidance or mitigation via communication/delay overlapping. The proposal provides contention avoidance in communication patterns that require back-to-back RMA epochs. It also solves all inefficiency patterns, plus a fifth one, late unlock, introduced and documented in our work.

J.A. Zounmevo, X. Zhao, P. Balaji, W. Gropp, and A. Afsahi, "Nonblocking Epochs in MPI one-sided Communication", 2014 IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing 2014), New Orleans, LA, USA, Nov. 16-21, 2014, pp. 475-486. Best Paper Award Finalist

Message Scheduling to Maximize RMA Communication Overlap

The one-sided communication model of MPI is based on the concept of epoch. An epoch is a region enclosed by a pair of matching opening and closing synchronizations. Inside the epoch, the one-sided communications are always non-blocking. The epoch-closing synchronization is blocking and does not exit until all the communications hosted by the epoch are completed, at least locally. The split-phase tandem created by the RMA communications and the blocking synchronizations creates the adequate conditions for communication/computation overlapping. In this work, we have proposed a message scheduling scheme that interleaves inter-node and intra-node data transfers in a way that minimizes the overall latency of the RMA epoch. We fully exploit the overlapping potential offered by the possible activity of the two engines embodied by RDMA (for network RMA) and the CPU (for intra-node RMA, in the absence of any I/O acceleration technology)

J.A. Zounmevo and A. Afsahi, "Intra-Epoch Message Scheduling to Exploit Unused or Residual Overlapping Potential", 21st EuroMPI conference, Kyoto, Japan, Sept. 9-12, 2014, pp. 13-19.

J.A. Zounmevo and A. Afsahi, "Exploiting Unused Middleware-level Parallelism with Large Payload Communication/Communication Overlapping in MPI One-sided Communication", International Journal of High Performance Computing Applications (under revision)

GPU-aware Communication in MPI

Equipping computing nodes in high-end computing systems with GPU accelerators have shown to be a promising approach to achieve higher performance, improved performance-per-watt and better compute density. In such systems, processors may offload part of their computationally intensive workload to the GPUs. The results of such computations may then need to be communicated among processes on the same or other computing nodes. Therefore, processes with their data residing in the GPU global memory require efficient support from the MPI library for high-performance communication. It has been shown that intra-node and inter-node communications among GPUs in such platforms have an important role on the performance of scientific and engineering applications. We have proposed two design alternatives for a GPU-aware intra-node MPI_Allreduce operation (and other collectives, for that matter) that perform the reduction operations within the GPU and leverage CUDA IPC for communication among processes involved in the collective operation.

I. Faraji and A. Afsahi, "GPU-Aware Intranode MPI_Allreduce", 21st EuroMPI conference, Kyoto, Japan, Sept. 9-12, 2014, pp. 45-50.

In another work, we have evaluated the effect of the MPS service on GPU-to-GPU communication using CUDA IPC and host-staged approaches. We have shown that the MPS service is indeed beneficial when multiple inter-process communications are in place. However, it is still required to make efficient design decisions to further harness the potential of this service. To this aim, we have proposed two design alternatives for intra-node MPI_Allgather and MPI_Allreduce operations: a Static and a Dynamic approach. While the two approaches use different algorithms they both use a mix of host-staged and CUDA IPC copy in the design of collectives.

I. Faraji and A. Afsahi, "Hyper-Q-Aware Intranode MPI Collectives on the GPU", International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)", Austin, TX ,Nov. 15, 2015.

Efficient Message Queues in MPI

A minimum of two message queues are required, both at the receive-side, to allow MPI communication operations. They are the unexpected message queue (UMQ) and the posted receive queue (PRQ). Message queues are solicited in point-to-point, collective and even modern RDMA-based implementations of Remote Memory Access (RMA) operations. MPI message queues have been shown to grow proportionately to the job size for many applications. With such a behavior and knowing that message queues are used very frequently, ensuring fast queue operations at large scales is of paramount importance in the current and the upcoming exascale computing eras. At the same time, a queue mechanism that is blind on memory requirements poses another scalability issue even if it solves the speed of operation problem.
In this work, we have proposed a scalable multidimensional queue traversal mechanism to provide fast and lean message queue management for MPI jobs at large scales. We resort to multiple decompositions of the search key. The proposal, built around a multidimensional data structure, exploits the characteristics of the contextId and rank components to considerably mitigate the effect of job sizes on the queue search times. We have done a runtime complexity analysis and a memory footprint analysis of the proposed message queue data structure with those of link list and array-based designs.

J.A. Zounmevo and A. Afsahi, "An Efficient MPI Message Queue Mechanism for Large-scale Jobs", 18^th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Singapore, Dec. 17-19, 2012, pp. 464-471.

J.A. Zounmevo and A. Afsahi, "A Fast and Resource-Conscious MPI Message Queue Mechanism for Large-Scale Jobs", Future Generation Computer Systems, 30(1):265-290, Jan. 2014

High-Performance Distributed Services

HPC distributed services, such as storage systems, are hosted in servers that span several nodes. They interact with clients that connect or disconnect when they need to. Such distributed services require network transports that offer high bandwidth and low latency. However, unlike HPC programs, distributed services are persistent services as they bear no concept of completion. These services are typically written in user space and require user-space networking APIs. In order to reduce porting efforts over modern networks, distributed services benefit from using a portable network API. All HPC platforms include an implementation of MPI as part of their software stack. Since MPI is one of the primary ways of programming these machines, the bundled MPI implementation is typically well optimized and routinely delivers maximum network performance. In this work, we have evaluated the use of MPI as a network portability layer for cross-application services.

J.A. Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, "Using MPI in High-Performance Computing Services", 20^th ACM EuroMPI Conference, Recent Advances in the Message Passing Interface, Madrid, Spain, Sept. 15-18, 2013, pp. 43-48.

J.A. Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, "Extreme-Scale Computing Services over MPI: Experiences, Observations and Features Proposal for Next Generation Message Passing Interface", International Journal of High Performance Computing Applications. 28(4):435-449, Sept. 2014

In another collaborative effort, an asynchronous remote procedure call (RPC) interface, Mercury, has been designed to serve as a basis for higher-level frameworks such as I/O forwarders, remote storage systems, or analysis frameworks that need to remotely exchange or operate on large data in a distributed environment.

J. Soumagne, D. Kimpe, J.A. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. Ross, "Mercury: Enabling Remote Procedure Call for High-Performance Computing", 15^th IEEE International Conference on Cluster Computing (Cluster), Indianapolis, IN, Sept. 23-27, 2013, pp. 1-8

Topology-aware Communication

With the emerging multi-core architectures and high-performance interconnects offering more parallelism and performance, parallel computing systems are becoming increasingly hierarchical in their node architecture and interconnection networks. Communication at different level of hierarchy demonstrates different performance levels. It is therefore critical for the communication libraries to efficiently handle the communication demands of the HPC applications on such hierarchical systems. We have designed the MPI non-distributed topology functions for efficient process mapping over hierarchical clusters. We have integrated the node physical topology with network architecture and used graph-embedding tools inside the MPI library to override the current trivial implementation of the topology functions and efficiently reorder the initial process mapping.

M.J. Rashti, J. Green, P. Balaji, A. Afsahi and W. Gropp, "Multi-core and Network Aware MPI Topology Functions", 18^th EuroMPI conference, Recent Advances in the Message Passing Interface, Santorini, Greece, Sept. 18-21, 2011, Lecture Notes in Computer Science (LNCS) 6960, pp. 50-60.

Efficient Message Progression, overlap, and Rendezvous Protocol

Overlap and Message Progression Ability

We analyze how MPI implementations support communication progress and communication/computation overlap on top of modern interconnects. This works contributes by providing a better understanding of the ability of contemporary interconnects and their MPI implementations in supporting communication progress, overlap and offload. Our study confirms that the offload ability needs to be supported with independent communication progress to increase the level of overlap.

M.J. Rashti and A. Afsahi, "Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects", 15^th Annual IEEE Symposium on High-Performance Interconnects (Hot Interconnects), Palo Alto, CA, Aug. 22-24, 2007, pp. 117-124.

A Speculative and Adaptive Rendezvous Protocol

Our earlier study showed that transferring large messages does not make progress independently, decreasing the chances of overlap in applications. This confirms that independent progress is required, at least for data transfer, to achieve high overlap ability with non-blocking communication. We have proposed a novel speculative Rendezvous protocol that uses RDMA Read and RDMA Write to effectively improve communication progress and consequently the overlap ability. In this proposal, the early-arrived receiver predicts the communication protocol based on its own local message size. If the predicted protocol is Rendezvous, a message similar to RTS (we call it Request to Receive or RTR), including the receiver buffer address is prepared and sent to the sender. At the sender side, if the Rendezvous protocol is chosen, the arrived RTR message is used to transfer the data to the receiver using RDMA Write. Otherwise, if the Eager protocol is chosen, the arrived RTR will be simply discarded.

M.J. Rashti and A. Afsahi, "Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects", 22^nd International Symposium on High Performance Computing Systems and Applications (HPCS), Quebec City, QC, June 9-11, 2008, pp. 95-101

M.J. Rashti and A. Afsahi, "A Speculative and Adaptive MPI Rendezvous Protocol over RDMA-enabled Interconnects", International Journal of Parallel Programming, 37(2):223-246, 2009

An Asynchronous Message Progression Technique

In another work, we looked into the potential issues with protocol enhancement approaches for large message transfer progression. One of the issues is its inability to work with MPI_ANY_SOURCE scenario when the Rendezvous is receive initiated. We have proposed a lightweight asynchronous message progression mechanism for large message transfers in MPI Rendezvous protocol that is scenario-conscious and consequently overhead free in cases where independent message progression naturally happens. Without requiring a dedicated thread, we take advantage of small bursts of CPU to poll for message transfer conditions. The existing application thread is parasitized for the purpose of getting those small bursts of CPU.

J.A. Zounmevo and A. Afsahi, "Investigating Scenario-conscious Asynchronous Rendezvous over RDMA", 13^th IEEE International Conference on Cluster Computing (Cluster), Austin, TX, Sept. 26-30, 2011, pp. 542-546.

MPI Interoperability with Active Messages

Many new large-scale applications have emerged recently and become important in areas such as bioinformatics and social networks. These applications are often data-intensive and involve irregular communication patterns and complex operations on remote processes. In such algorithms the receiver may not know how many messages to expect or even from which receivers to expect messages; therefore, active messages are considered effective for parallelizing such nontraditional applications. In this collaborative effort, an active messages framework inside MPI (on top of MPI RMA) has been developed to provide portability and programmability.

X. Zhao, D. Buntinas, J.A. Zounmevo, J. Dinan, D. Goodell, P. Balaji, R. Thakur, A. Afsahi, and W. Gropp, "Towards Asynchronous, and MPI-Interoperable Active Messages", 13^th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Delft, The Netherlands, May 13-16, 2013, pp. 87-94.

RDMA over Unreliable Datagrams

The current iWARP standard is only defined on reliable connection-oriented transports. Such a protocol suffers from scalability issues in large-scale applications due to memory requirements associated with multiple inter-process connections. In addition, some applications and data services do not require the reliability overhead and implementation complexity and cost associated with connection-oriented transports such as TCP. Many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only defined over reliable, connection-oriented transports. We have proposed to extend the iWARP standard on top of the User Datagram Protocol (UDP) in order to utilize the inherent scalability, low implementation cost and the minimal overhead of datagram protocols. We have provided guidelines and discussed the required extensions to different layers of the current iWARP standard in order to support the connectionless UDP transport. Our proposal is designed to co-exist with and to be consistent and compatible with the current connection-oriented iWARP.

M.J. Rashti, R.E. Grant, P. Balaji, and A. Afsahi, "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet", 17^th International Conference on High Performance Computing (HiPC), Goa, India, Dec. 19-22, 2010, pp. 1-10.

In a follow-up work, we proposed the first RDMA operation over unreliable datagrams that can significantly increase iWARP performance and scalability and expand the application space that iWARP can serve to include some very network intensive applications. In order to support RDMA over unreliable datagrams, we proposed RDMA Write-Record, a proposal for the design and implementation of, what is to our knowledge, the first RDMA design over an unreliable datagram transport. It is designed to be extremely lightweight and to be used in an environment in which packet loss occurs.

R.E. Grant, M.J. Rashti, P. Balaji, and A. Afsahi, "RDMA Capable iWARP over Datagrams", 25^th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Anchorage, AK, May 16-20, 2011, pp. 628-639.

R.E. Grant, M.J. Rashti, P. Balaji, and A. Afsahi, "Scalable Connectionless RDMA over Unreliable Datagrams", Parallel Computing, 48(10): 15:39, Oct. 2015.

Efficient Collective Communication in MPI

Parallel processes in such simulations compute on their local data while extensively communicating with each other through the MPI library. Such communications involve frequent MPI collective communication operations, where a group of processes collectively participate in the communication operation. Previous studies of application usage show that the performance and the scalability of MPI collective communication operations are critical to HPC applications.

Collective communications on multi-rail QsNet^II clusters

Quadrics networks has a native support for message striping over multi-rail QsNet^II networks only for large point-to-point messages through its Elan RDMA put and get, SHMEM put and get, and Tports send/receive functions. Rather than devising single-port (MPI) collectives that utilize the underlying striping facilities, we proposed and evaluated multi-port collective schemes with striping directly at the Elan level using RDMA Write.

Y. Qian and A. Afsahi, "Efficient RDMA-based Multi-port Collectives on Multi-rail QsNet^II Clusters", 6^th Workshop on Communication Architecture for Clusters (CAC), Rhodes Island, Greece, Apr. 25-29, 2006, pp. 1-8.

Y. Qian and A. Afsahi, "High Performance RDMA-based Multi-port All-gather on Multi-rail QsNet^II", 21^st International Symposium on High Performance Computing Systems and Applications (HPCS), Saskatoon, SK, May 13-16, 2007.

Our earlier work in this area was not optimized for modern SMP clusters. Modern computing nodes all use multiple processor cores, and intra-node communication is typically done though shared memory, where inter-node communication goes through the network. We have proposed and evaluated multi-port RDMA-based and shared memory-aware all-gather algorithms with message striping. Some of the proposed algorithms overlap intra-node and inter-node communication, and use multiple outstanding RDMAs to exploit concurrency. Moreover, data buffers are shared between inter-node and intra-node communications.

Y. Qian and A. Afsahi, "RDMA-based and SMP-aware Multi-port All-gather on Multi-rail QsNet^II SMP Clusters", 36^th International Conference on Parallel Processing (ICPP), XiAn, China, Sept. 10-14, 2007.

Y. Qian and A. Afsahi, "Efficient Shared Memory and RDMA based Collectives on Multi-rail QsNet^II SMP Clusters", Cluster Computing, The Journal of Networks, Software Tools and Applications, 11(4):341-354, 2008

Multi-connection and Multi-core Aware Collectives on InfiniBand clusters

This research targets both the parallelism available in multi-core nodes and the availability of multi-connection capabilities in modern InfiniBand interconnects in collective design. In multi-core systems, each core will run at least one process with possible connections to other processes. It is therefore very important for the network interface card and its communication software to provide scalable performance for simultaneous communication over increasing number of connections. We used the multi-core processors for a better system and network utilization along with shared-memory communication and multi-connection network features and devised a number of collective algorithms.

Y. Qian, M.J. Rashti, and A. Afsahi, "Multi-connection and Multi-core Aware All-Gather on InfiniBand Clusters", 20^th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Orlando, FL, Nov. 16-18, 2008, pp. 1-7. Best Paper Award in the area of Software Systems and Tools.

Process Arrival Pattern Aware Collectives

Most research papers consider their collective design in a controlled environment. However, recent studies have shown that processes in real applications may arrive at the collective calls at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective operations. We have proposed novel RDMA-based process arrival pattern aware MPI Alltoall algorithms over InfiniBand clusters. We extended the algorithms to be shared memory aware for small to medium size messages.

Y. Qian and A. Afsahi, "Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand", 16^th EuroPVM/MPI Conference, Espoo, Finland, Sept. 7-10, 2009, Lecture Notes in Computer Science (LNCS) 5759, pp. 250-260.

Y. Qian and A. Afsahi, "Process Arrival Pattern Aware Alltoall and Allgather on InfiniBand Clusters", International Journal of Parallel Programming, 39(4):473-493, Aug. 2011

Offloading Non-blocking Collectives

Non-blocking collective operations have been recently proposed and included in the MPI-3 standard. Non-blocking collectives support the idea of communication/computation overlap, and allow communication latency to be hidden by computation; this will effectively improves application performance. However, one of the most important factors in achieving a high level of overlap is the ability of the communication subsystem to make progress on outstanding communication operations in the collective. Offloading is a well-known approach that allows offloading communication processing to an intelligent network processor, when possible. In this work, we focused on hiding the collective latency by efficiently offloading it to the networking hardware (Mellanox CORE-Direct offloading technology) that allows a sequence of communication operations to be progressed by the network hardware without host intervention.

G. Inozemtsev and A. Afsahi, "Designing an Offloaded Nonblocking MPI_Allgather Collective using CORE-Direct", 14^th IEEE International Conference on Cluster Computing (Cluster), Beijing, China, Sept. 24-28, 2012, pp. 477-485.

Enhancing MPI Eager Protocol for Small Message Transfers

In MPI implementations, an Eager protocol is used to eagerly transfer small messages to the receiver to avoid extra overhead of pre-negotiation. RDMA-based communication requires the source and destination buffers to be registered to avoid swapping memory buffers before the DMA engine can access them. Memory registration is an expensive process that involves buffer pin-down and virtual-physical address translation. We propose to register the frequently used application buffers so that we could initiate RDMA operations directly from the application buffers rather than the intermediate buffers (infrequently-used buffers are treated as before). This way, we can decrease the cost of communication by skipping the sender-side data copy.

M.J. Rashti and A. Afsahi, "Improving RDMA-based MPI Eager Protocol for Frequently-used Buffers", 9^thWorkshop on Communication Architecture for Clusters (CAC), Rome, Italy, May 25-29, 2009, pp. 1-8.

M.J. Rashti and A. Afsahi, "Exploiting Application Buffer Reuse to improve MPI Communications", Cluster Computing, The Journal of Networks, Software Tools and Applications, 14(4):345-356, Dec. 2011.

High-Performance Networking for Data Centers

QoS Provisioning for IP-based Protocols over InfiniBand

Current enterprise data centers typically use Ethernet networks. IP-based protocol stacks such as TCP/IP have been widely used by many applications in enterprise data centers. Such stacks have traditionally been known to incur high overheads. Consequently, high-speed networks, such as InfiniBand, have relied on alternative protocol stacks in order to allow applications to take advantage of the capabilities offered by the network, or offer virtual protocol interconnects to seamlessly integrate front-end Ethernet networking with back-end InfiniBand support. The management of network traffic is of great concern within modern networks. Quality of Service (QoS) provisioning can be used to regulate traffic such that, for example intra-network traffic can have priority over incoming inter-network traffic.

R.E. Grant, M.J. Rashti, and A. Afsahi, "An Analysis of QoS Provisioning for Sockets Direct Protocol vs. IPoIB over Modern InfiniBand Networks", International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), Portland, OR, Sept. 12, 2008, pp. 79-86.

Virtual Protocol Interconnect for Data Centers

Virtual Protocol Interconnect (VPI) is a converged networking concept by Mellanox Technologies that allows an adapter to transparently migrate between native mode and Ethernet mode without requiring manual reconfiguration. VPI facilitates the use of hybrid interconnect architectures in data centers, allowing the compute systems to interact in native IB mode, while allowing them to interact with the remote clients in Ethernet mode. VPI also allows for easier deployment of non-Ethernet network technologies in data centers by providing a seamless socket-based interface over IB.

R.E. Grant, A. Afsahi, and P. Balaji, "Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers", 15^th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Shenzhen, China, Dec. 8-11, 2009, pp. 57-64.

Hardware Assisted IP over InfiniBand for Data Centers

While the methods of hardware offload for the IP stack have existed for other networks, including Ethernet networks, for many years, the introduction of such capabilities (Large Send Offload and Large Receive Offload) for IB networks removes some of the barriers preventing the adoption of IB in enterprise data centers.

R.E. Grant, P. Balaji, and A. Afsahi, "A Study of Hardware Assisted IP over InfiniBand and its Impact on Data Center Performance", 10^th IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), White Plains, NY, Mar. 28-30, 2010, pp. 144-153.

Power-aware High-Performance Computing

Reduced OS Noise Scheduling for Energy-efficiency

Power consumption has become an important design constraint in servers and high-performance server clusters. We have explored the power-performance efficiency of Hyper-Threaded (HT) Asymmetric Multiprocessor (AMP) servers, and proposed a scheduling algorithm that can be used to reduce the overall power consumption of a server while maintaining a high level of performance. Our earlier work masks off a single logical/physical core for operating system tasks only and scales its frequency in order to save power, while running user threads on the remaining cores at maximum frequency. In another work, we proposed for the system to have one core in the system running at its full clock speed performing OS and user tasks while the remaining cores run user threads at lower operating frequencies.

R.E. Grant and A. Afsahi, "Power-Performance Efficiency of Asymmetric Multiprocessors for Multi-threaded Scientific Applications", 2^nd Workshop on High-Performance, Power-Aware Computing (HP-PAC), Rhodes Island, Greece, Apr. 25-29, 2006, pp. 1-8

R.E. Grant and A. Afsahi, "Improving System Efficiency through Scheduling and Power Management", Invited paper, International Workshop on Green Computing, Austin, TX, September 17, 2007, pp. 478-479.

R.E. Grant and A. Afsahi, "Improving Energy Efficiency of Asymmetric Chip Multithreaded Multiprocessors through Reduced OS Noise Scheduling", Concurrency and Computation: Practice and Experience, 21(18):2355-2376, Dec 2009.

High-Performance Interconnects Feasibility Analysis for Power and Energy Efficiency

We have demonstrated the positive impact of modern interconnects in delivering energy-efficiency in high- performance clusters. For that, we presented the power-performance profiles of the Myrinet-2000 and Quadrics QsNet^II networks at the user-level and MPI-level in comparison to a traditional, non-offloaded Gigabit Ethernet. Secondly, we have devised a power-aware MPI library that automatically and transparently performs message segmentation and re-assembly for point-to-point communications in order to boost the energy savings.

R. Zamani, A. Afsahi, Y. Qian, and C. Hamacher, "A Feasibility Analysis of Power-Awareness and Energy Minimization in Modern Interconnects for High-Performance Computing", 9^th IEEE International Conference on Cluster Computing (Cluster), Austin, TX, Sept. 17-20, 2007, pp. 118-128.

Adaptive estimation and prediction of power/performance

To have an effective power management system in place, it is essential to model and estimate the runtime power of a computing system. Performance monitoring counters (PMCs) along with regression methods are commonly used in this regard to model and estimate the runtime power. However, architectural intuitions remain fundamental with regards to the current models that relate a computing system�s power to its PMCs. In an orthogonal approach, we examine such relationships from a stochastic aspect.

R. Zamani and A. Afsahi, "Adaptive Estimation and Prediction of Power and Performance in High Performance Computing", Journal of Computer Science - Research and Development, 25(3-4):177-186, Sept. 2010. Special Issue, International Conference on Energy-Aware High Performance Computing (ENA-HPC),

Power Modeling using Hardware Performance Monitoring Counter

The foundation of many power/energy saving methods is based on power consumption models, which commonly rely on hardware performance monitoring counters. PMCs can monitor various events that are provided by processor manufacturers. In most processors, the number of PMC events is significantly larger than the number of events that can be measured simultaneously. Previously, architectural intuitions have guided selection of PMCs for modeling workload/power consumption of a system. However, it is not clear which PMC event "group" selection fits such power models the best when multiple PMCs can be utilized simultaneously in a model. Therefore, a comprehensive study of PMC events with regards to power modeling is needed to understand and enhance such power models.

R. Zamani and A. Afsahi, "A Study of Hardware Performance Monitoring Counter Selection in Power Modeling of Computing Systems", 2^nd International Workshop on Power Measurement and Profiling (PMP), San Jose, CA, June 5-8, 2012, pp. 1-10.

Workload Characterization

Characterizing scientific applications or parallel programming paradigms on state-of-the-art platforms would allow a better understanding as to where the tuning must be done in order to improve the performance. The bottleneck may be related to the parallel programming API or its implementation, the underlying architectural features, and/or the characteristics of the application under study.

Characterization of OpenMP Constructs and Applications

Understanding the performance and scalability of OpenMP constructs on specific systems is therefore critical to the development of efficient parallel programs. We have evaluated the performance of OpenMP constructs and application benchmarks on a 72-way Sun Fire 15K multiprocessor system. We have shown the performance of basic OpenMP constructs using the EPCC microbenchmarks, NAS OpenMP benchmarks, and the SPEC OMP2001 benchmarks.

N.R. Fredrickson, A. Afsahi, and Y. Qian, "Performance Characteristics of OpenMP Constructs, and Application Benchmarks on a Large Symmetric Multiprocessor", 17^th Annual ACM International Conference on Supercomputing (ICS), San Francisco, CA, June 23-26, 2003, pp. 140-149.

We have extended our earlier work to simultaneous multithreading (SMT) processors. Hyper-Threading (a form of SMT) due to extensive resource sharing may not suitably benefit OpenMP applications. On dual and quad HT-based Intel Xeon servers, we found that the overhead of OpenMP constructs with HT is an order of magnitude larger than when HT is off. Our performance results with NAS and SPEC OMPM2001 suites indicate majority of applications benefit from having a second thread in one-processor situations. However, only a few applications enjoy performance gain when HT is enabled on both processors. Data from hardware performance counters verifies trace cache misses and its delivery rate are sources of performance bottleneck.

R.E. Grant and A. Afsahi, "Characterization of Multithreaded Scientific Workloads on Simultaneous Multithreading Intel Processors", Workshop on Interaction between Operating System and Computer Architecture (IOSCA), Austin, TX, Oct. 6-8, 2005, pp. 13-19.

In another work, we targeted hybrid chip multithreaded SMPs. Such systems present new challenges as well as new opportunities to maximize performance. Our intention was to discover the optimal operating configuration of such systems for scientific applications and to identify the shared resources that might become a bottleneck to performance under the different hardware configurations. This knowledge will be useful to the research community in developing software techniques to improve the performance of shared memory programs on modern multi-core multiprocessors.

R.E. Grant and A. Afsahi, "A Comprehensive Analysis of Multithreaded OpenMP Applications on Dual-Core Intel Xeon SMPs", Workshop on Multithreaded Architectures and Applications (MTAAP), Long Beach, CA, Mar. 26-30, 2007, pp. 1-8.

Communication Characteristics of MPI Applications

Communication performance is an important factor that affects the performance of message-passing parallel applications running on clusters. A proper understanding of communication behavior of parallel applications will help designing better communication subsystems and MPI libraries in the future. It will also help application developers to maximize their application performance on a target architecture. In this work, we examined the message passing communication characteristics of applications in the NAS Multi-Zone parallel benchmark suite as well as two applications in the SPEChpc suite.

R. Zamani and A. Afsahi, "Communication Characteristics of Message-Passing Scientific and Engineering Applications", 17^th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Phoenix, AZ, Nov. 14-16, 2005, pp. 644-649.

Evaluation of High-Performance Interconnects and their Messaging Layers

In different phases of distributed computations, hosts may exchange a large number of short and long messages over the interconnection network. To achieve performance on network-based computing systems, the interconnection network and the communication system software must provide mechanisms to support efficient communications. Meanwhile, in a distributed environment, applications usually run on top of a standard middleware such as the Message Passing Interface, which itself runs on top of a user-level messaging layer. To determine if the applications can benefit from a particular interconnect, it is essential to assess the various features and the performance of the interconnect at both the user and middleware levels. In our first study, we evaluated the Sun Fire Link interconnect from Sun Microsystems. Sun Fire Link is a memory-based interconnect with layered system software components that implements a mechanism for user-level messaging based on direct access to remote memory regions of other nodes. This is referred to as Remote Shared Memory (RSM). We assessed the performance of the interconnect at the RSM and MPI level.

Afsahi and Y. Qian, "Remote Shared Memory over Sun Fire Link Interconnect", 15^th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Marina del Rey, CA, Nov. 3-5, 2003, pp. 381-386.

Y. Qian, A. Afsahi, N.R. Fredrickson, and R. Zamani, "Performance Evaluation of the Sun Fire Link SMP Clusters", International Journal of High Performance Computing and Networking, 4(5/6):209-221, 2006.

In another work, we evaluated the performance of the Myrinet two-port networks at the user-level (GM), and the MPI level. The microbenchmarks were designed to assess the quality of MPI implementation on top of GM.

Y. Qian, A. Afsahi, and R. Zamani, "Myrinet Networks: A Performance Study", 3^rd IEEE International Symposium on Network Computing and Applications (NCA), Cambridge, MA, Aug. 30 � Sept. 1, 2004, pp. 323-328.

R. Zamani, Y. Qian, and A. Afsahi, "An Evaluation of the Myrinet/GM2 Two-Port Networks", 3^rd IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, Nov. 16-18, 2004, pp. 734-742.

In another study, we assessed the potential of the NetEffect iWARP Ethernet for high-performance computing. The results show a significant improvement in Ethernet performance, as well as a high multi-connection performance scalability.

M.J. Rashti and A. Afsahi, "10-Gigabit iWARP Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G", 7^th Workshop on Communication Architecture for Clusters (CAC), Long Beach, CA, Mar. 26-30, 2007, pp. 1-8