Research Directions (current and past)
We are currently working on a number of research fronts including
topology-aware
communication, MPI message queues and matching engine, neighborhood collective
communication, GPU-aware
communication, high-performance communication for deep learning, communication/computation overlap and message progression, datatype proessing, one-sided communication, in-network computing, and congestion control-aware communication, among others. We will report in a near future.
A summary of some of our past work is provided below, in no particular order:
- Enhancing MPI Remote Memory Access Communication
Remote Memory Access
communication has been receiving a lot of attention these days due to
advantages that come with a one-sided communication model, such as the
availability of direct remote memory hardware support, no involvement required
at the destination, decoupling communication and synchronization, and its
usefulness for certain classes of applications, among other things. MPI 3.1
has addressed a number of significant issues with the RMA in MPI 2.2; yet,
there are still challenges ahead. The research community is currently working
on a number of directions to enhance both the standard and the implementation
of the MPI RMA model.
Non-blocking Synchronization for MPI RMA
One of the issues with the current MPI RMA standard is its synchronization
model, which can lead to serialization and latency propagation. We have
proposed entirely non-blocking RMA synchronizations that would allow processes
to avoid waiting even in epoch-closing routines. Because the entire MPI-RMA
epoch can be non-blocking, MPI processes can issue the communications and move
on immediately. Conditions are thus created for (1) enhanced
communication/computation overlapping, (2) enhanced
communication/communication overlapping, and (3) delay propagation avoidance
or mitigation via communication/delay overlapping. The proposal provides
contention avoidance in communication patterns that require back-to-back RMA
epochs. It also solves all inefficiency patterns, plus a fifth one, late
unlock, introduced and documented in our work.
- J.A. Zounmevo, X. Zhao, P. Balaji, W. Gropp, and A. Afsahi, "Nonblocking
Epochs in MPI one-sided Communication", 2014 IEEE/ACM International
Conference for High Performance Computing, Networking, Storage and Analysis
(Supercomputing 2014), New Orleans, LA, USA, Nov. 16-21, 2014, pp.
475-486. Best Paper Award Finalist
Message Scheduling to Maximize RMA Communication
Overlap
The one-sided communication model of MPI is based
on the concept of epoch. An epoch is a region enclosed by a pair of matching
opening and closing synchronizations. Inside the epoch, the one-sided
communications are always non-blocking. The epoch-closing synchronization is
blocking and does not exit until all the communications hosted by the epoch
are completed, at least locally. The split-phase tandem created by the RMA
communications and the blocking synchronizations creates the adequate
conditions for communication/computation overlapping. In this work, we have
proposed a message scheduling scheme that interleaves inter-node and
intra-node data transfers in a way that minimizes the overall latency of the
RMA epoch. We fully exploit the overlapping potential offered by the possible
activity of the two engines embodied by RDMA (for network RMA) and the CPU
(for intra-node RMA, in the absence of any I/O acceleration technology)
- J.A. Zounmevo and A. Afsahi, "Intra-Epoch Message Scheduling to Exploit
Unused or Residual Overlapping Potential", 21st EuroMPI conference,
Kyoto, Japan, Sept. 9-12, 2014, pp. 13-19.
- J.A. Zounmevo and A. Afsahi, "Exploiting Unused
Middleware-level Parallelism with Large Payload Communication/Communication
Overlapping in MPI One-sided Communication", International Journal of
High Performance Computing Applications (under revision)
- GPU-aware Communication in MPI
Equipping computing nodes in high-end computing
systems with GPU accelerators have shown to be a promising approach to achieve
higher performance, improved performance-per-watt and better compute density.
In such systems, processors may offload part of their computationally
intensive workload to the GPUs. The results of such computations may
then need to be communicated among processes on the same or other computing
nodes. Therefore, processes with their data residing in the GPU global memory
require efficient support from the MPI library for high-performance
communication. It has been shown that intra-node and inter-node
communications among GPUs in such platforms have an important role on the
performance of scientific and engineering applications. We have proposed two
design alternatives for a GPU-aware intra-node MPI_Allreduce operation (and
other collectives, for that matter) that perform the reduction operations
within the GPU and leverage CUDA IPC for communication among processes
involved in the collective operation.
- I. Faraji and A. Afsahi, "GPU-Aware Intranode MPI_Allreduce", 21st
EuroMPI conference, Kyoto, Japan, Sept. 9-12, 2014, pp. 45-50.
In another work, we have evaluated the effect of the MPS service on
GPU-to-GPU communication using CUDA IPC and host-staged approaches. We have
shown that the MPS service is indeed beneficial when multiple inter-process
communications are in place. However, it is still required to make efficient
design decisions to further harness the potential of this service. To this
aim, we have proposed two design alternatives for intra-node MPI_Allgather and
MPI_Allreduce operations: a Static and a Dynamic approach. While the two
approaches use different algorithms they both use a mix of host-staged and
CUDA IPC copy in the design of collectives.
- I. Faraji and A. Afsahi, "Hyper-Q-Aware
Intranode MPI Collectives on the GPU",
International Workshop on Extreme Scale Programming Models
and Middleware (ESPM2)", Austin, TX ,Nov.
15, 2015.
- Efficient Message Queues in MPI
A
minimum of two message queues are required, both at the receive-side, to allow
MPI communication operations. They are the unexpected message queue (UMQ) and
the posted receive queue (PRQ). Message queues are solicited in
point-to-point, collective and even modern RDMA-based implementations of
Remote Memory Access (RMA) operations.
MPI message queues have been
shown to grow proportionately to the job size for many applications. With such
a behavior and knowing that message queues are used very frequently, ensuring
fast queue operations at large scales is of paramount importance in the
current and the upcoming exascale computing eras. At the same time, a queue
mechanism that is blind on memory requirements poses another scalability issue
even if it solves the speed of operation problem.
In this work, we have proposed a scalable multidimensional queue traversal
mechanism to provide fast and lean message queue management for MPI jobs at
large scales. We resort to multiple decompositions of the search key. The
proposal, built around a multidimensional data structure, exploits the
characteristics of the contextId and rank components to considerably mitigate
the effect of job sizes on the queue search times. We have done a runtime
complexity analysis and a memory footprint analysis of the proposed message
queue data structure with those of link list and array-based designs.
-
J.A. Zounmevo and A. Afsahi, "An Efficient MPI Message Queue Mechanism
for Large-scale Jobs", 18th IEEE International Conference on
Parallel and Distributed Systems (ICPADS), Singapore, Dec. 17-19, 2012,
pp. 464-471.
- J.A. Zounmevo and A. Afsahi, "A Fast and Resource-Conscious MPI Message
Queue Mechanism for Large-Scale Jobs", Future Generation Computer
Systems, 30(1):265-290, Jan. 2014
- High-Performance Distributed Services
HPC distributed services, such as storage systems, are hosted in servers
that span several nodes. They interact with clients that connect or disconnect
when they need to. Such distributed services require network transports that
offer high bandwidth and low latency. However, unlike HPC programs,
distributed services are persistent services as they bear no concept of
completion. These services are typically written in user space and require
user-space networking APIs. In order to reduce porting efforts over modern
networks, distributed services benefit from using a portable network API. All
HPC platforms include an implementation of MPI as part of their software
stack. Since MPI is one of the primary ways of programming these machines, the
bundled MPI implementation is typically well optimized and routinely delivers
maximum network performance. In this work, we have evaluated the use of MPI as
a network portability layer for cross-application services.
- J.A. Zounmevo, D. Kimpe, R. Ross, and A.
Afsahi, "Using MPI in High-Performance Computing Services", 20th
ACM EuroMPI Conference, Recent Advances in the Message Passing
Interface, Madrid, Spain, Sept. 15-18, 2013, pp. 43-48.
- J.A. Zounmevo, D. Kimpe, R. Ross, and A. Afsahi, "Extreme-Scale
Computing Services over MPI: Experiences, Observations and Features Proposal
for Next Generation Message Passing Interface", International Journal of
High Performance Computing Applications. 28(4):435-449, Sept. 2014
In another collaborative effort, an asynchronous remote procedure call (RPC)
interface, Mercury, has been designed to serve as a basis for higher-level
frameworks such as I/O forwarders, remote storage systems, or analysis
frameworks that need to remotely exchange or operate on large data in a
distributed environment.
- J. Soumagne, D. Kimpe, J.A. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi,
and R. Ross, "Mercury: Enabling Remote Procedure Call for High-Performance
Computing", 15th IEEE International Conference on Cluster
Computing (Cluster), Indianapolis, IN, Sept. 23-27, 2013, pp. 1-8
- Topology-aware Communication
With the emerging multi-core architectures and
high-performance interconnects offering more parallelism and performance,
parallel computing systems are becoming increasingly hierarchical in their
node architecture and interconnection networks. Communication at different
level of hierarchy demonstrates different performance levels. It is therefore
critical for the communication libraries to efficiently handle the
communication demands of the HPC applications on such hierarchical systems. We
have designed the MPI non-distributed topology functions for efficient process
mapping over hierarchical clusters. We have integrated the node physical
topology with network architecture and used graph-embedding tools inside the
MPI library to override the current trivial implementation of the topology
functions and efficiently reorder the initial process mapping.
- M.J. Rashti, J. Green, P. Balaji, A. Afsahi and W. Gropp, "Multi-core
and Network Aware MPI Topology Functions", 18th EuroMPI
conference, Recent Advances in the Message Passing Interface, Santorini,
Greece, Sept. 18-21, 2011, Lecture Notes in Computer Science (LNCS)
6960, pp. 50-60.
- Efficient Message Progression, overlap, and Rendezvous Protocol
Overlap and Message Progression Ability
We analyze how MPI implementations support
communication progress and communication/computation overlap on top of modern
interconnects. T
his works contributes by providing a better
understanding of the ability of contemporary interconnects and their MPI
implementations in supporting communication progress, overlap and offload.
Our study confirms that the offload ability needs to be supported with
independent communication progress to increase the level of overlap.
- M.J. Rashti and A. Afsahi, "Assessing the Ability of
Computation/Communication Overlap and Communication Progress in Modern
Interconnects", 15th Annual IEEE Symposium on High-Performance
Interconnects (Hot Interconnects), Palo Alto, CA, Aug. 22-24, 2007, pp.
117-124.
A Speculative and Adaptive Rendezvous Protocol
Our earlier study showed that transferring large messages does not make
progress independently, decreasing the chances of overlap in applications.
This confirms that independent progress is required, at least for data
transfer, to achieve high overlap ability with non-blocking communication. We
have proposed a novel speculative Rendezvous protocol that uses RDMA Read and
RDMA Write to effectively improve communication progress and consequently the
overlap ability. In this proposal, the early-arrived receiver predicts the
communication protocol based on its own local message size. If the predicted
protocol is Rendezvous, a message similar to RTS (we call it Request to
Receive or RTR), including the receiver buffer address is prepared and sent to
the sender. At the sender side, if the Rendezvous protocol is chosen, the
arrived RTR message is used to transfer the data to the receiver using RDMA
Write. Otherwise, if the Eager protocol is chosen, the arrived RTR will be
simply discarded.
- M.J. Rashti and A. Afsahi, "Improving Communication Progress and Overlap
in MPI Rendezvous Protocol over RDMA-enabled Interconnects", 22nd
International Symposium on High Performance Computing Systems and
Applications (HPCS), Quebec City, QC, June 9-11, 2008, pp. 95-101
- M.J. Rashti and A. Afsahi, "A Speculative and Adaptive MPI Rendezvous
Protocol over RDMA-enabled Interconnects", International Journal of
Parallel Programming, 37(2):223-246, 2009
An Asynchronous Message Progression Technique
In another work, we looked into the potential issues with protocol
enhancement approaches for large message transfer progression. One of the
issues is its inability to work with MPI_ANY_SOURCE scenario when the
Rendezvous is receive initiated. We have proposed a lightweight asynchronous
message progression mechanism for large message transfers in MPI Rendezvous
protocol that is scenario-conscious and consequently overhead free in cases
where independent message progression naturally happens. Without requiring a
dedicated thread, we take advantage of small bursts of CPU to poll for message
transfer conditions. The existing application thread is parasitized for the
purpose of getting those small bursts of CPU.
- J.A. Zounmevo and A. Afsahi, "Investigating Scenario-conscious
Asynchronous Rendezvous over RDMA", 13th IEEE International
Conference on Cluster Computing (Cluster), Austin, TX, Sept. 26-30,
2011, pp. 542-546.
- MPI Interoperability with Active Messages
Many new large-scale applications have emerged
recently and become important in areas such as bioinformatics and social
networks. These applications are often data-intensive and involve irregular
communication patterns and complex operations on remote processes. In such
algorithms the receiver may not know how many messages to expect or even from
which receivers to expect messages; therefore, active messages are considered
effective for parallelizing such nontraditional applications. In this
collaborative effort, an active messages framework inside MPI (on top of MPI
RMA) has been developed to provide portability and programmability.
- X. Zhao, D. Buntinas, J.A. Zounmevo, J. Dinan, D. Goodell, P.
Balaji, R. Thakur, A. Afsahi, and W. Gropp, "Towards Asynchronous, and MPI-Interoperable
Active Messages", 13th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing (CCGrid), Delft, The Netherlands, May
13-16, 2013, pp. 87-94.
- RDMA over Unreliable Datagrams
The current iWARP standard is only defined on reliable connection-oriented
transports. Such a protocol suffers from scalability issues in large-scale
applications due to memory requirements associated with multiple inter-process
connections. In addition, some applications and data services do not require
the reliability overhead and implementation complexity and cost associated
with connection-oriented transports such as TCP. Many datacenter and web-based
applications, such as stock-market trading and media-streaming applications,
that rely on datagram-based semantics (mostly through UDP/IP) cannot take
advantage of it because the iWARP standard is only defined over reliable,
connection-oriented transports. We have proposed to extend the iWARP standard
on top of the User Datagram Protocol (UDP) in order to utilize the inherent
scalability, low implementation cost and the minimal overhead of datagram
protocols. We have provided guidelines and discussed the required extensions
to different layers of the current iWARP standard in order to support the
connectionless UDP transport. Our proposal is designed to co-exist with and to
be consistent and compatible with the current connection-oriented iWARP.
- M.J. Rashti, R.E. Grant, P. Balaji, and A. Afsahi, "iWARP Redefined:
Scalable Connectionless Communication over High-Speed Ethernet", 17th
International Conference on High Performance Computing (HiPC), Goa,
India, Dec. 19-22, 2010, pp. 1-10.
In a follow-up work, we proposed the first RDMA
operation over unreliable datagrams that can significantly increase iWARP performance and scalability and expand the application space that iWARP can
serve to include some very network intensive applications. In order to support
RDMA over unreliable datagrams, we proposed RDMA Write-Record, a
proposal for the design and implementation of, what is to our knowledge, the
first RDMA design over an unreliable datagram transport. It is designed to be
extremely lightweight and to be used in an environment in which packet loss
occurs.
- R.E. Grant, M.J. Rashti, P. Balaji, and A. Afsahi, "RDMA Capable iWARP
over Datagrams", 25th IEEE International Parallel and
Distributed Processing Symposium (IPDPS), Anchorage, AK, May 16-20,
2011, pp. 628-639.
- R.E. Grant, M.J. Rashti, P. Balaji, and A. Afsahi, "Scalable
Connectionless RDMA over Unreliable Datagrams", Parallel Computing,
48(10): 15:39, Oct. 2015.
- Efficient Collective Communication in MPI
Parallel processes in such simulations compute on their local data while
extensively communicating with each other through the MPI library. Such
communications involve frequent MPI collective communication
operations, where a group of processes collectively participate in the
communication operation. Previous studies of application usage show that the
performance and the scalability of MPI collective communication operations are
critical to HPC applications.
Collective communications on multi-rail QsNetII
clusters
Quadrics networks has a native support for message striping over multi-rail
QsNetII networks only for large point-to-point messages through its
Elan RDMA put and get, SHMEM put and get, and Tports send/receive functions.
Rather than devising single-port (MPI) collectives that utilize the underlying
striping facilities, we proposed and evaluated multi-port collective schemes
with striping directly at the Elan level using RDMA Write.
- Y. Qian and A. Afsahi, "Efficient RDMA-based Multi-port Collectives on
Multi-rail QsNetII Clusters", 6th Workshop on
Communication Architecture for Clusters (CAC), Rhodes Island, Greece,
Apr. 25-29, 2006, pp. 1-8.
- Y. Qian and A. Afsahi, "High Performance RDMA-based Multi-port
All-gather on Multi-rail QsNetII", 21st
International Symposium on High Performance Computing Systems and
Applications (HPCS), Saskatoon, SK, May 13-16, 2007.
Our earlier work in this area was not optimized for modern SMP clusters.
Modern computing nodes all use multiple processor cores, and intra-node
communication is typically done though shared memory, where inter-node
communication goes through the network. We have proposed and evaluated
multi-port RDMA-based and shared memory-aware all-gather algorithms with
message striping. Some of the proposed algorithms overlap intra-node and
inter-node communication, and use multiple outstanding RDMAs to exploit
concurrency. Moreover, data buffers are shared between inter-node and
intra-node communications.
- Y. Qian and A. Afsahi, "RDMA-based and SMP-aware Multi-port All-gather
on Multi-rail QsNetII SMP Clusters", 36th
International Conference on Parallel Processing (ICPP), XiAn, China,
Sept. 10-14, 2007.
- Y. Qian and A. Afsahi, "Efficient Shared Memory and RDMA based
Collectives on Multi-rail QsNetII SMP Clusters", Cluster
Computing, The Journal of Networks, Software Tools and Applications,
11(4):341-354, 2008
Multi-connection and Multi-core Aware Collectives
on InfiniBand clusters
This research targets both the parallelism available in multi-core nodes
and the availability of multi-connection capabilities in modern InfiniBand
interconnects in collective design. In multi-core systems, each core will run
at least one process with possible connections to other processes. It is
therefore very important for the network interface card and its communication
software to provide scalable performance for simultaneous communication over
increasing number of connections. We used the multi-core processors for a
better system and network utilization along with shared-memory communication
and multi-connection network features and devised a number of collective
algorithms.
- Y. Qian, M.J. Rashti, and A. Afsahi, "Multi-connection and Multi-core
Aware All-Gather on InfiniBand Clusters", 20th IASTED
International Conference on Parallel and Distributed Computing and Systems (PDCS),
Orlando, FL, Nov. 16-18, 2008, pp. 1-7. Best Paper Award in the area of
Software Systems and Tools.
Process Arrival Pattern Aware Collectives
Most research papers consider their collective design in a controlled
environment. However, recent studies have shown that processes in real
applications may arrive at the collective calls at different times. This
imbalanced process arrival pattern can significantly affect the performance of
the collective operations. We have proposed novel RDMA-based process arrival
pattern aware MPI Alltoall algorithms over InfiniBand clusters. We extended
the algorithms to be shared memory aware for small to medium size messages.
- Y. Qian and A. Afsahi, "Process Arrival Pattern and Shared Memory Aware
Alltoall on InfiniBand", 16th EuroPVM/MPI Conference,
Espoo, Finland, Sept. 7-10, 2009, Lecture Notes in Computer Science (LNCS)
5759, pp. 250-260.
- Y. Qian and A. Afsahi, "Process Arrival Pattern Aware Alltoall and
Allgather on InfiniBand Clusters", International Journal of Parallel
Programming, 39(4):473-493, Aug. 2011
Offloading Non-blocking Collectives
Non-blocking collective operations have been recently proposed and included
in the MPI-3 standard. Non-blocking collectives support the idea of
communication/computation overlap, and allow communication latency to be
hidden by computation; this will effectively improves application performance.
However, one of the most important factors in achieving a high level of
overlap is the ability of the communication subsystem to make progress on
outstanding communication operations in the collective. Offloading is a
well-known approach that allows offloading communication processing to an
intelligent network processor, when possible. In this work, we focused on
hiding the collective latency by efficiently offloading it to the networking
hardware (Mellanox CORE-Direct offloading technology) that allows a sequence
of communication operations to be progressed by the network hardware without
host intervention.
- G. Inozemtsev and A. Afsahi, "Designing an Offloaded Nonblocking
MPI_Allgather Collective using CORE-Direct", 14th IEEE
International Conference on Cluster Computing (Cluster), Beijing, China,
Sept. 24-28, 2012, pp. 477-485.
- Enhancing MPI Eager Protocol for Small Message Transfers
In MPI implementations,
an Eager protocol is
used to eagerly transfer small messages to the receiver to avoid extra
overhead of pre-negotiation. RDMA-based communication requires the source and
destination buffers to be registered to avoid swapping memory buffers before
the DMA engine can access them. Memory registration is an expensive process
that involves buffer pin-down and virtual-physical address translation.
We propose to register the frequently used application
buffers so that we could initiate RDMA operations directly from the
application buffers rather than the intermediate buffers (infrequently-used
buffers are treated as before). This way, we can decrease the cost of
communication by skipping the sender-side data copy.
- M.J. Rashti and A. Afsahi, "Improving RDMA-based MPI Eager Protocol for
Frequently-used Buffers", 9th Workshop on
Communication Architecture for Clusters (CAC), Rome, Italy, May 25-29,
2009, pp. 1-8.
- M.J. Rashti and A. Afsahi, "Exploiting Application Buffer Reuse to
improve MPI Communications", Cluster Computing, The Journal of Networks,
Software Tools and Applications, 14(4):345-356, Dec. 2011.
- High-Performance Networking for Data Centers
QoS Provisioning for IP-based Protocols over
InfiniBand
Current enterprise data centers typically use Ethernet networks. IP-based
protocol stacks such as TCP/IP have been widely used by many applications in
enterprise data centers. Such stacks have traditionally been known to incur
high overheads. Consequently, high-speed networks, such as InfiniBand, have
relied on alternative protocol stacks in order to allow applications to take
advantage of the capabilities offered by the network, or offer virtual
protocol interconnects to seamlessly integrate front-end Ethernet networking
with back-end InfiniBand support. The management of network traffic is of
great concern within modern networks. Quality of Service (QoS) provisioning
can be used to regulate traffic such that, for example intra-network traffic
can have priority over incoming inter-network traffic.
- R.E. Grant, M.J. Rashti, and A. Afsahi, "An Analysis of QoS Provisioning
for Sockets Direct Protocol vs. IPoIB over Modern InfiniBand Networks",
International Workshop on Parallel Programming Models and Systems Software
for High-End Computing (P2S2), Portland, OR, Sept. 12, 2008, pp. 79-86.
Virtual Protocol Interconnect for Data Centers
Virtual Protocol Interconnect (VPI)
is a converged networking concept by Mellanox
Technologies that allows an adapter to transparently migrate between native
mode and Ethernet mode without requiring manual reconfiguration. VPI
facilitates the use of hybrid interconnect architectures in data centers,
allowing the compute systems to interact in native IB mode, while allowing
them to interact with the remote clients in Ethernet mode. VPI also allows for
easier deployment of non-Ethernet network technologies in data centers by
providing a seamless socket-based interface over IB.
- R.E. Grant, A. Afsahi, and P. Balaji, "Evaluation of ConnectX Virtual
Protocol Interconnect for Data Centers", 15th IEEE
International Conference on Parallel and Distributed Systems (ICPADS),
Shenzhen, China, Dec. 8-11, 2009, pp. 57-64.
Hardware Assisted IP over InfiniBand for Data
Centers
While the methods of hardware offload for the IP stack have existed for
other networks, including Ethernet networks, for many years, the introduction
of such capabilities (Large Send Offload and Large Receive Offload) for IB
networks removes some of the barriers preventing the adoption of IB in
enterprise data centers.
- R.E. Grant, P. Balaji, and A. Afsahi, "A Study of Hardware Assisted IP
over InfiniBand and its Impact on Data Center Performance", 10th
IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS), White Plains, NY, Mar. 28-30, 2010, pp. 144-153.
- Power-aware High-Performance Computing
Reduced OS Noise Scheduling for Energy-efficiency
Power consumption has become an important design constraint in servers and
high-performance server clusters. We have explored the power-performance
efficiency of Hyper-Threaded (HT) Asymmetric Multiprocessor (AMP) servers, and
proposed a scheduling algorithm that can be used to reduce the overall power
consumption of a server while maintaining a high level of performance.
Our earlier work masks off a single logical/physical
core for operating system tasks only and scales its frequency in order to save
power, while running user threads on the remaining cores at maximum frequency.
In another work, we proposed for the system to have one core in the system
running at its full clock speed performing OS and user tasks while the
remaining cores run user threads at lower operating frequencies.
- R.E. Grant and A. Afsahi, "Power-Performance Efficiency of Asymmetric
Multiprocessors for Multi-threaded Scientific Applications", 2nd
Workshop on High-Performance, Power-Aware Computing (HP-PAC), Rhodes
Island, Greece, Apr. 25-29, 2006, pp. 1-8
- R.E. Grant and A. Afsahi, "Improving System Efficiency through
Scheduling and Power Management", Invited paper, International
Workshop on Green Computing, Austin, TX, September 17, 2007, pp.
478-479.
- R.E. Grant and A. Afsahi, "Improving Energy Efficiency of Asymmetric
Chip Multithreaded Multiprocessors through Reduced OS Noise Scheduling",
Concurrency and Computation: Practice and Experience, 21(18):2355-2376,
Dec 2009.
High-Performance Interconnects Feasibility
Analysis for Power and Energy Efficiency
We have demonstrated the positive impact of modern interconnects in
delivering energy-efficiency in high- performance clusters. For that, we
presented the power-performance profiles of the Myrinet-2000 and Quadrics
QsNetII networks at the user-level and MPI-level in comparison to a
traditional, non-offloaded Gigabit Ethernet. Secondly, we have devised a
power-aware MPI library that automatically and transparently performs message
segmentation and re-assembly for point-to-point communications in order to
boost the energy savings.
- R. Zamani, A. Afsahi, Y. Qian, and C. Hamacher, "A Feasibility Analysis
of Power-Awareness and Energy Minimization in Modern Interconnects for
High-Performance Computing", 9th IEEE International Conference
on Cluster Computing (Cluster), Austin, TX, Sept. 17-20, 2007, pp.
118-128.
Adaptive estimation and prediction of
power/performance
To have an effective power management system in place, it is essential to
model and estimate the runtime power of a computing system. Performance
monitoring counters (PMCs) along with regression methods are commonly used in
this regard to model and estimate the runtime power. However, architectural
intuitions remain fundamental with regards to the current models that relate a
computing systemÕs power to its PMCs. In an orthogonal approach, we examine
such relationships from a stochastic aspect.
- R. Zamani and A. Afsahi, "Adaptive Estimation and Prediction of Power
and Performance in High Performance Computing", Journal of Computer
Science - Research and Development, 25(3-4):177-186, Sept. 2010. Special
Issue, International Conference on Energy-Aware High Performance
Computing (ENA-HPC),
Power Modeling using Hardware Performance
Monitoring Counter
The foundation of many power/energy saving
methods is based on power consumption models, which commonly rely on hardware
performance monitoring counters. PMCs can monitor various events that are
provided by processor manufacturers. In most processors, the number of PMC
events is significantly larger than the number of events that can be measured
simultaneously. Previously, architectural intuitions have guided selection of
PMCs for modeling workload/power consumption of a system. However, it is not
clear which PMC event "group" selection fits such power models the best when
multiple PMCs can be utilized simultaneously in a model. Therefore, a
comprehensive study of PMC events with regards to power modeling is needed to
understand and enhance such power models.
- R. Zamani and A. Afsahi, "A Study of Hardware Performance Monitoring
Counter Selection in Power Modeling of Computing Systems", 2nd
International Workshop on Power Measurement and Profiling (PMP), San
Jose, CA, June 5-8, 2012, pp. 1-10.
- Workload Characterization
Characterizing scientific applications or
parallel programming paradigms on state-of-the-art platforms would allow a
better understanding as to where the tuning must be done in order to improve
the performance. The bottleneck may be related to the parallel programming API
or its implementation, the underlying architectural features, and/or the
characteristics of the application under study.
Characterization of OpenMP Constructs and
Applications
Understanding the performance and scalability of
OpenMP constructs on specific systems is therefore critical to the development
of efficient parallel programs. We have evaluated the performance of OpenMP
constructs and application benchmarks on a 72-way Sun Fire 15K multiprocessor
system. We have shown the performance of basic OpenMP constructs using the
EPCC microbenchmarks, NAS OpenMP benchmarks, and the SPEC OMP2001 benchmarks.
- N.R. Fredrickson, A. Afsahi, and Y. Qian, "Performance Characteristics
of OpenMP Constructs, and Application Benchmarks on a Large Symmetric
Multiprocessor", 17th Annual ACM International Conference on
Supercomputing (ICS), San Francisco, CA, June 23-26, 2003, pp.
140-149.
We have extended our earlier work to simultaneous
multithreading (SMT) processors. Hyper-Threading (a form of SMT) due to
extensive resource sharing may not suitably benefit OpenMP applications. On
dual and quad HT-based Intel Xeon servers, we found that the overhead of
OpenMP constructs with HT is an order of magnitude larger than when HT is off.
Our performance results with NAS and SPEC OMPM2001 suites indicate majority of
applications benefit from having a second thread in one-processor situations.
However, only a few applications enjoy performance gain when HT is enabled on
both processors. Data from hardware performance counters verifies trace cache
misses and its delivery rate are sources of performance bottleneck.
- R.E. Grant and A. Afsahi, "Characterization of Multithreaded Scientific
Workloads on Simultaneous Multithreading Intel Processors", Workshop on
Interaction between Operating System and Computer Architecture (IOSCA),
Austin, TX, Oct. 6-8, 2005, pp. 13-19.
In another work, we targeted hybrid chip
multithreaded SMPs. Such systems present new challenges as well as new
opportunities to maximize performance. Our intention was to discover the
optimal operating configuration of such systems for scientific applications
and to identify the shared resources that might become a bottleneck to
performance under the different hardware configurations. This knowledge will
be useful to the research community in developing software techniques to
improve the performance of shared memory programs on modern multi-core
multiprocessors.
- R.E. Grant and A. Afsahi, "A Comprehensive Analysis of Multithreaded
OpenMP Applications on Dual-Core Intel Xeon SMPs", Workshop on
Multithreaded Architectures and Applications (MTAAP), Long Beach, CA,
Mar. 26-30, 2007, pp. 1-8.
Communication Characteristics of MPI Applications
Communication performance is an important factor
that affects the performance of message-passing parallel applications running
on clusters. A proper understanding of communication behavior of parallel
applications will help designing better communication subsystems and MPI
libraries in the future. It will also help application developers to maximize
their application performance on a target architecture. In this work, we
examined the message passing communication characteristics of applications in
the NAS Multi-Zone parallel benchmark suite as well as two applications in the
SPEChpc suite.
- R. Zamani and A. Afsahi, "Communication Characteristics of
Message-Passing Scientific and Engineering Applications", 17th
IASTED International Conference on Parallel and Distributed Computing and
Systems (PDCS), Phoenix, AZ, Nov. 14-16, 2005, pp. 644-649.
- Evaluation of High-Performance Interconnects and their Messaging Layers
In different phases of distributed computations,
hosts may exchange a large number of short and long messages over the
interconnection network. To achieve performance on network-based computing
systems, the interconnection network and the communication system software
must provide mechanisms to support efficient communications. Meanwhile, in a
distributed environment, applications usually run on top of a standard
middleware such as the Message Passing Interface, which itself runs on top of
a user-level messaging layer. To determine if the applications can benefit
from a particular interconnect, it is essential to assess the various features
and the performance of the interconnect at both the user and middleware
levels. In our first study, we evaluated the Sun Fire Link
interconnect from Sun Microsystems. Sun Fire Link is a memory-based
interconnect with layered system software components that implements a
mechanism for user-level messaging based on direct access to remote memory
regions of other nodes. This is referred to as Remote Shared Memory (RSM). We
assessed the performance of the interconnect at the RSM and MPI level.
- Afsahi and Y. Qian, "Remote Shared Memory over Sun Fire Link
Interconnect", 15th IASTED International Conference on
Parallel and Distributed Computing and Systems (PDCS), Marina del
Rey, CA, Nov. 3-5, 2003, pp. 381-386.
- Y. Qian, A. Afsahi, N.R. Fredrickson, and R. Zamani, "Performance
Evaluation of the Sun Fire Link SMP Clusters", International Journal of
High Performance Computing and Networking, 4(5/6):209-221, 2006.
In another work, we evaluated the performance of the Myrinet two-port
networks at the user-level (GM), and the MPI level. The microbenchmarks were
designed to assess the quality of MPI implementation on top of GM.
- Y. Qian, A. Afsahi, and R. Zamani, "Myrinet Networks: A Performance
Study", 3rd IEEE International Symposium on Network Computing
and Applications (NCA), Cambridge, MA, Aug. 30 Š Sept. 1, 2004, pp.
323-328.
- R. Zamani, Y. Qian, and A. Afsahi, "An Evaluation of the Myrinet/GM2
Two-Port Networks", 3rd IEEE Workshop on High-Speed Local
Networks (HSLN), Tampa, FL, Nov. 16-18, 2004, pp. 734-742.
In another study, we assessed the potential of
the NetEffect iWARP Ethernet for high-performance computing.
The
results show a significant improvement in Ethernet performance, as well as a
high multi-connection performance scalability.
- M.J. Rashti and A. Afsahi, "10-Gigabit iWARP Ethernet: Comparative
Performance Analysis with InfiniBand and Myrinet-10G", 7th
Workshop on Communication Architecture for Clusters (CAC), Long Beach,
CA, Mar. 26-30, 2007, pp. 1-8