Tutorials

 The SC Tutorials program is always one of the highlights of the SC Conference, offering attendees a variety of short courses on key topics and technologies relevant to high performance computing, networking, storage, analysis. Tutorials also provide the opportunity to interact with recognized leaders in the field and to learn about the latest technology trends, theory, and practical techniques. As in years past, tutorial submissions were subjected to a rigorous peer review process. Of the 74 submissions, the 30 members of the Tutorials Committee selected the following 30 tutorials for presentation.


A "Hands-On" introduction to OpenMP

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Tim Mattson, Mark Bull, Mike Pearce

Abstract: OpenMP is the de facto standard for writing parallel applications for shared memory computers. With multi-core processors in everything from tablets to high-end servers, the need for multithreaded applications is growing and OpenMP is one of the most straightforward ways to write such programs. In this tutorial, we will cover the core features of the OpenMP 3.1 standard. This will be a hands-on tutorial. We expect students to use their own laptops (with Windows, Linux, or OS/X). We will have access to systems with OpenMP (a remote SMP server), but the best option is for students to load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.


Structured Parallel Programming with Patterns

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Michael Hebenstreit, James R. Reinders, Arch D. Robison, Michael McCool

Abstract: Parallel programming is important for performance, and developers need a comprehensive set of strategies and technologies for tackling it. This tutorial is intended for C++ programmers who want to better grasp how to envision, describe and write efficient parallel algorithms at the single shared-memory node level. This tutorial will present a set of algorithmic patterns for parallel programming. Patterns describe best known methods for solving recurring design problems. Algorithmic patterns in particular are the building blocks of algorithms. Using these patterns to develop parallel algorithms will lead to better structured, more scalable, and more maintainable programs. This course will discuss when and where to use a core set of parallel patterns, how to best implement them, and how to analyze the performance of algorithms built using them. Patterns to be presented include map, reduce, scan, pipeline, fork-joint, stencil, tiling, and recurrence. Each pattern will be demonstrated using working code in one or more of Cilk Plus, Threading Building Blocks, OpenMP, or OpenCL. Attendees also will have the opportunity to test the provided examples themselves on an HPC cluster for the time of the SC13 conference.


OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Heidi Poxon, James Beyer, Luiz DeRose, Alistair Hart

Abstract: Portability and programming difficulty are two critical hurdles in generating widespread adoption of accelerated computing in high performance computing. The dominant programming models for accelerator-based systems (CUDA and OpenCL) offer the power to extract performance from accelerators, but with extreme costs in usability, maintenance, development, and portability. To be an effective HPC platform, hybrid systems need a high-level programming environment to enable widespread porting and development of applications that run efficiently on either accelerators or CPUs. In this hands-on tutorial we present the high-level OpenACC parallel programming model for accelerator-based systems, demonstrating compilers, libraries, and tools that support this cross-vendor initiative. Using personal experience in porting large-scale HPC applications, we provide development guidance, practical tricks, and tips to enable effective and efficient use of these hybrid systems.


Hands-On Practical Hybrid Parallel Application Performance Engineering

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Markus Geimer, Brian J. N. Wylie, Bert Wesarg, Sameer Shende

Abstract: This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the Score-P community instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid MPI+OpenMP, and increasingly common usage of accelerators. Parallel performance evaluation tools from the VI-HPS (Virtual Institute High Productivity Supercomputing) are introduced and featured in hands-on exercises with Scalasca, Vampir and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers with a provided Linux Live-ISO image containing the tools (booted from DVD/USB or within a virtual machine) will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.


Debugging MPI and Hybrid/Heterogeneous Applications at Scale

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): David Lecomber, Matthias S. Mueller, Tobias Hilbrich, Ganesh Gopalakrishnan, Bronis R. de Supinski

Abstract: MPI programming is error prone due to the complexity of MPI semantics and the difficulties of parallel programming. Increasing heterogeneity (e.g., MPI plus OpenMP/CUDA), scale, non-determinism, and platform dependent bugs exacerbate these difficulties. This tutorial covers the detection/correction of errors in MPI programs at small and large scale, as well as for heterogeneous/hybrid programs. We will first introduce our main tools: MUST, that detects MPI usage errors at runtime with a high degree of automation; ISP/DAMPI, that detects interleaving-dependent MPI deadlocks and assertion violations through application replay; and DDT, a highly scalable parallel debugger. Attendees will be encouraged to explore our tools early during the tutorial to better appreciate their strengths/limitations. We will present best practices and a cohesive workflow for comprehensive application debugging with all our tools. We dedicate the afternoon session to advanced use-cases, tool deployment on Leadership-Scale systems, updates on new tool functionality, and for the debugging of hybrid/heterogeneous programming models. The latter includes debugging approaches for MPI, OpenMP, and CUDA and is especially crucial for systems such as Titan (ORNL) and Sequoia (LLNL). DDT's capabilities for CUDA/OpenMP debugging will be presented, in addition to a short introduction to GKLEE, a new symbolic verifier for CUDA applications.


Advanced MPI Programming

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Torsten Hoefler, James Dinan, Pavan Balaji, Rajeev Thakur

Abstract: The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. For example, several MPI applications are running at full scale on the Sequoia system (on approximately 1.6 million cores) and achieving 12 to 14 petaflops/s of sustained performance. At the same time, the MPI standard itself is evolving (MPI-3 was released late last year) to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI, including new MPI-3 features, that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived data types, one-sided communication, hybrid (MPI + shared memory) programming, topologies and topology mapping, and neighborhood and non-blocking collectives. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.


Parallel Computing 101

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Quentin F. Stout, Christiane Jablonowski

Abstract: This tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, managers, students and anyone seeking an overview of parallel computing. It discusses software and hardware, with an emphasis on standards, portability, and systems that are widely available. The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering and scientific problems. These examples illustrate using MPI on distributed memory systems, OpenMP on shared memory systems, MPI+OpenMP on hybrid systems, GPU programming, and Hadoop on big data. It discusses numerous parallelization approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools. The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how they are used and what they are most suitable for. Extensive pointers to the literature and web-based resources are provided to facilitate follow-up studies.


Programming for the Intel Xeon Phi

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Lucas A. Wilson, John D McCalpin, Kent Milfeld

Abstract: The Innovative Technology component of the recently deployed XSEDE Stampede supercomputer at TACC provides access to 8 PetaFlops of computing power in the form of the new Intel Xeon Phi Coprocessor, also known as a MIC. While the MIC is x86 based, hosts its own Linux OS, and is capable of running most user codes with little porting effort, the MIC architecture has significant features that are different from that of present x86 CPUs, and optimal performance requires an understanding of the possible execution models and basic details of the architecture. This tutorial is designed to introduce attendees to the MIC architecture in a practical manner. Multiple lectures and hands-on exercises will be used to acquaint attendees with the MIC platform and explore the different execution modes as well as parallelization and optimization through example testing and reports.


Globus Online and the Science DMZ as Scalable Research Data Management Infrastructure for HPC Facilities

Date: Sunday, November 17th
Time: 8:30am-5pm
Presenter(s): Rajkumar Kettimuthu, Vas Vasiliadis, Steve Tuecke, Eli Dart

Abstract: The rapid growth of data in scientific research endeavors is placing massive demands on campus computing centers and high-performance computing (HPC) facilities. Computing facilities must provide robust data services built on high-performance infrastructure, while continuing to scale as needs increase. Traditional research data management (RDM) solutions are typically difficult to use and error-prone, and the underlying networking and security infrastructure is often complex and inflexible, resulting in user frustration and sub-optimal use of resources. An increasingly common solution in HPC facilities is Globus Online deployed in a network environment built on the Science DMZ model. Globus Online is software-as-a-service for moving, syncing, and sharing large data sets. The Science DMZ model is a set of design patterns for network equipment, configuration, and security policy for high-performance scientific infrastructure. The combination of user-friendly, high-performance data transfer tools, and optimally configured underlying infrastructure results in enhanced RDM services that increase user productivity and lower support overhead. Guided by two case studies from national supercomputing centers (NERSC and NCSA), attendees will explore the challenges such facilities face in delivering scalable RDM solutions. Attendees will be introduced to Globus Online and the Science DMZ, and will learn how to deploy and manage these systems.


Large Scale Visualization with ParaView

Date: Sunday, November 17th
Time: 8:30am-12pm
Presenter(s): W. Alan Scott, David DeMarle, Li-Ta Lo, Kenneth Moreland

Abstract: ParaView is a powerful open-source turnkey application for analyzing and visualizing large data sets in parallel. Designed to be configurable, extendible, and scalable, ParaView is built upon the Visualization Toolkit (VTK) to allow rapid deployment of visualization components. This tutorial presents the architecture of ParaView and the fundamentals of parallel visualization. Attendees will learn the basics of using ParaView for scientific visualization with hands-on lessons. The tutorial features detailed guidance in visualizing the massive simulations run on today's supercomputers and an introduction to scripting and extending ParaView. Attendees should bring laptops to install ParaView and follow along with the demonstrations.


Hybrid MPI and OpenMP Parallel Programming

Date: Sunday, November 17th
Time: 8:30am-12pm
Presenter(s): Gabriele Jost, Rolf Rabenseifner, Georg Hager

Abstract: Most HPC systems are clusters of shared memory nodes. Such systems can be PC clusters with single/multi-socket and multi-core SMP nodes, but also constellation type systems with large SMP nodes. Parallel programming may combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node. This tutorial analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. MPI-3.0 introduced a new shared memory programming interface, which can be combined with MPI message passing and remote memory access on the cluster interconnect. It can be used for direct neighbor accesses similar to OpenMP or for direct halo copies, and enables new hybrid programming models. These models are compared with various hybrid MPI+OpenMP approaches and pure MPI. This tutorial also includes a discussion on OpenMP support for accelerators. Benchmark results on different platforms are presented. Numerous case studies demonstrate the performance-related aspects of hybrid programming, and application categories that can take advantage of this model are identified. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a "how-to" section. Details: https://fs.hlrs.de/projects/rabenseifner/publ/SC2013-hybrid.html


InfiniBand and High-speed Ethernet for Dummies

Date: Sunday, November 17th
Time: 8:30am-12pm
Presenter(s): Hari Subramoni, Dhabaleswar K. (DK) Panda

Abstract: InfiniBand (IB) and High-Speed Ethernet (HSE) technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing and Big Data (Hadoop, HBase and Memcached) environments. RDMA over Converged Enhanced Ethernet (RoCE) technology is also emerging. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB and HSE. In-depth overview of the architectural features of IB and HSE (including iWARP and RoCE), their similarities and differences, and the associated protocols will be presented. Next, an overview of the emerging OpenFabrics stack which encapsulates IB, HSE and RoCE in a unified manner will be presented. Hardware/software solutions and the market trends behind IB, HSE and RoCE will be highlighted. Finally, sample performance numbers of these technologies and protocols for different environments will be presented.


Practical Fault Tolerance on Today's HPC Systems

Date: Sunday, November 17th
Time: 1:30pm-5pm
Presenter(s): Nathan Debardeleben, Eric Roman, Laxmikant V. Kale, Kathryn Mohror

Abstract: The failure rates on high performance computing systems are increasing with increasing component count. Applications running on these systems currently experience failures on the order of days; however, on future systems, predictions of failure rates range from minutes to hours. Developers need to defend their application runs from losing valuable data by using fault tolerant techniques. These techniques range from changing algorithms, to checkpoint and restart, to programming model-based approaches. In this tutorial, we will present introductory material for developers who wish to learn fault tolerant techniques available on today's systems. We will give background information on the kinds of faults occurring on today's systems and trends we expect going forward. Following this, we will give detailed information on several fault tolerant approaches and how to incorporate them into applications. Our focus will be on scalable coordinated checkpoint and restart mechanisms and programming model-based approaches for MPI applications.


Scaling I/O Beyond 100,000 Cores using ADIOS

Date: Sunday, November 17th
Time: 1:30pm-5pm
Presenter(s): Qing Liu, Scott Klasky, Norbert Podhorszki

Abstract: As concurrency continues to increase on high-end machines, from both the number of cores and storage devices, we must look for a revolutionary way to treat I/O. As a matter of fact, one of the major roadblocks to exascale is how to write and read big datasets quickly and efficiently on high-end machines. On the other hand applications often want to process data in an efficient and flexible manner, in terms of data formats and operations performed (e.g., files, data streams). In this tutorial we will show how users can do that and get high performance with ADIOS on 100,000+ cores. Part I of this tutorial will introduce parallel I/O and the ADIOS framework to the audience. Specifically we will discus the concept of ADIOS I/O abstraction, the binary-packed file format, and I/O methods along with the benefits to applications. Since 1.4.1, ADIOS can operate on both files and data streams. Part II will include a session on how to write/read data, and how to use different I/O componentizations inside of ADIOS. Part III will show users how to take advantage of the ADIOS framework to do compression/indexing. Finally, we will discuss how to run in-situ visualization using VisIt/Paraview+ADIOS.


Advanced Topics in InfiniBand and High-Speed Ethernet for Designing High-End Computing Systems

Date: Sunday, November 17th
Time: 1:30pm-5pm
Presenter(s): Hari Subramoni, Dhabaleswar K. (DK) Panda

Abstract: As InfiniBand (IB) and High-Speed Ethernet (HSE) technologies mature, they are being used to design and deploy different kinds of High-End Computing (HEC) systems: HPC clusters with accelerators (GPGPUs and MIC) supporting MPI and PGAS (UPC and OpenSHMEM), Storage and Parallel File Systems, Cloud Computing with Virtualization, Big Data systems with Hadoop (HDFS, MapReduce and HBase), Multi-tier Datacenters with Web 2.0 (memcached) and Grid Computing systems. These systems are bringing new challenges in terms of performance, scalability, and portability. Many scientists, engineers, researchers, managers and system administrators are becoming interested in learning about these challenges, approaches being used to solve these challenges, and the associated impact on performance and scalability. This tutorial will start with an overview of these systems and a common set of challenges being faced while designing these systems. Advanced hardware and software features of IB and HSE and their capabilities to address these challenges will be emphasized. Next, case studies focusing on domain-specific challenges in designing these systems (including the associated software stacks), their solutions and sample performance numbers will be presented. The tutorial will conclude with a set of demos focusing on RDMA programming, network management infrastructure and tools to effectively use these systems.


Ensuring Network Performance with perfSONAR

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Jason Zurawski

Abstract: A key component to Super Computing is Super Networking. Scientific data sets continue to increase in number, size, and importance for numerous research and education (R&E) communities, and rely on networks to facilitate sharing. Solicitations, such as the NSF's CC-NIE, have recognized this; new paradigms in networking, increased capacities, and emphasis on monitoring were specifically mentioned as areas of potential funding. A user's network experience must be reliable free of architectural flaws, and physical limitations. Operational staffs are limited in the support they can deliver, normally within a domain; innovative tools are required to solve the end-to-end" performance problems that hamper network use. End users should be familiar with these tools; to protect their own interests and expectations, and to assist operations in debugging exercises. We will present an overview of network performance tools and techniques, focusing on the deployment of the pS-Performance Toolkit. This all-in-one , community developed, monitoring solution allows local control, while providing a global view of performance that will directly impact the use of networks. Goals include familiarizing attendees on ways these tools may aid in debugging networks, hosts, and applications, as well as the proper way to install and configure software for personal use.


Advanced OpenMP: Performance and 4.0 Features

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Michael Klemm, Bronis R. de Supinski, Christian Terboven, Ruud van der Pas

Abstract: With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP but rather with the lack of depth with which it is employed. Our Advanced OpenMP Programming tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance. While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. We discuss language features in-depth, with emphasis on features recently added to OpenMP such as tasking and cancellation. We close with an overview of the new OpenMP 4.0 directives for attached compute accelerators.


OpenCL: A Hands-On Introduction

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Tim Mattson, Alice Koniges, Simon McIntosh-Smith

Abstract: OpenCL is an open standard for programming heterogeneous parallel computers composed of CPUs, GPUs and other processors. OpenCL consists of a framework to manipulate the host CPU and one or more compute devices (CPUs, GPUs or accelerators), and a C-based programming language for writing programs for the compute devices. Using OpenCL, a programmer can write parallel programs that harness all of the resources of a heterogeneous computer. In this hands-on tutorial, we will introduce OpenCL. For ease of learning we will focus on the easier to use C++ API, but attendees will also gain an understanding of OpenCL's C API. The format will be a 50/50 split between lectures and exercises. Students will use their own laptops (Windows, Linux or OS/X) and log into a remote server running an OpenCL platform on a range of different processors. Alternatively, students can load OpenCL onto their own laptops prior to the course (Intel, AMD and NVIDIA provide OpenCL SDKs. Apple laptops with X-code include OpenCL by default). By the end of the course, attendees will be able to write and optimize OpenCL programs, and will have a collection of example codes to help with future OpenCL program development.


Asynchronous Hybrid and Heterogeneous Parallel Programming with MPI/OmpSs and its Impact in Energy Efficient Architectures for Exascale Systems

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Eduard Ayguadé, Rosa M. Badia, Jesus Labarta, Alex Ramirez

Abstract: Due to its asynchronous nature and look-ahead capabilities, MPI/OmpSs is a promising programming model approach for future exascale systems, with the potential to exploit unprecedented amounts of parallelism, while coping with memory latency, network latency and load imbalance. Many large-scale applications are already seeing very positive results from their ports to MPI/OmpSs (see EU projects Montblanc, DEEP, TEXT). We will first cover the basic concepts of the programming model. OmpSs can be seen as an extension of the OpenMP model. Unlike OpenMP, however, task dependencies are determined at runtime thanks to the directionality of data arguments. The OmpSs runtime supports asynchronous execution of tasks on heterogeneous systems such as SMPs, GPUs and clusters thereof. The integration of OmpSs with MPI facilitates the migration of current MPI applications and improves the performance of these applications by overlapping computation with communication between tasks. The tutorial will also cover the performance tools available for the programming model: Paraver performance analysis tool. Examples of benchmarks and applications parallelized with MPI/OmpSs will also be presented. The tutorial will present the impact of the programming model to address the limitations of using low-end devices to build power-efficient parallel platforms. The tutorial will also include hands-on.


The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Gerhard Wellein, Georg Hager, Jan Treibig

Abstract: The advent of multi- and many-core chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to "efficiently" scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. Also, the potential of node-level improvements is widely underestimated, thus it is vital to understand the performance-limiting factors on modern hardware. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as well as the performance properties of the dominant MPI and OpenMP programming models, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering is introduced as a powerful tool that helps the user assess the impact of possible code optimizations by establishing models for the interaction of the software with the hardware.


Parallel I/O in Practice

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Katie Antypas, Brent Welch, Robert Latham, Robert B. Ross

Abstract: I/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack from parallel file systems at the lowest layer, to intermediate layers (such as MPI-IO), and finally high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance. Benchmarks on real systems are used throughout to show real-world results. This tutorial first discusses parallel file systems in detail (PFSs). We cover general concepts and examine four examples: GPFS, Lustre, PanFS, and PVFS. We examine the upper layers of the I/O stack, covering POSIX I/O, MPI-IO, Parallel netCDF, and HDF5. We discuss interface features, show code examples, and describe how application calls translate into PFS operations. Finally we discuss I/O best practice.


Linear Algebra Libraries for HPC: Scientific Computing with Multicore and Accelerators

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Jakub Kurzak, Jack Dongarra, James Demmel, Michael A. Heroux

Abstract: Today, desktops with a multicore processor and a GPU accelerator can already provide a TeraFlop/s of performance, while the performance of the high-end systems, based on multicores and accelerators, is already measured in PetaFlop/s. This tremendous computational power can only be fully utilized with the appropriate software infrastructure, both at the low end (desktop, server) and at the high end (supercomputer installation). Most often a major part of the computational effort in scientific and engineering computing goes in solving linear algebra sub-problems. After providing a historical overview of legacy software packages, the tutorial surveys the current state-of-the-art numerical libraries for solving problems in linear algebra, both dense and sparse. PLASMA, MAGMA and Trilinos software packages are discussed in detail. The tutorial also highlights recent advances in algorithms that minimize communication, i.e. data motion, which is much more expensive than arithmetic.


Debugging and Optimizing MPI and OpenMP Applications Running on CUDA , OpenACC®, and Intel® Xeon Phi Coprocessors with TotalView® and ThreadSpotter

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Chris Gottbrath, Sandra Wienke, Mike Ashworth, Vincent C. Betro, Jessica Fishman, Nikolay Piskun

Abstract: With High-Performance Computing trends heading towards increasingly heterogeneous solutions, scientific developers face challenges adapting software to leverage these new systems. For instance, many systems feature nodes that couple multi-core processors with GPU-based computational accelerators, like the NVIDIA® Kepler, or many-core coprocessors, like the Intel® Xeon Phi coprocessor. In order to utilize these systems, scientific programmers need to leverage as much parallelism in applications as possible. Developers also need to juggle technologies including MPI, OpenMP, CUDA, and OpenACC. While troubleshooting, debugging, and optimizing applications are an expected part of porting, they become even more critical with the introduction of so many technologies. This tutorial provides an introduction to parallel debugging and optimization. Debugging techniques covered include: MPI and subset debugging, process and thread sets, reverse and comparative debugging, and techniques for CUDA, OpenACC, and Intel Xeon Phi coprocessor debugging. Participants will have the opportunity to do hands-on CUDA and Intel Xeon Phi coprocessor debugging using TotalView on a cluster at RWTH Aachen University and on Keeneland and Beacon at NICS. Therefore, it is recommended that participants bring a network-capable laptop to the session. Optimization techniques will include profiling, tracing, and cache memory optimization. Examples will use ThreadSpotter and vendor-supplied tools.


Python in HPC

Date: Monday, November 18th
Time: 8:30am-5pm
Presenter(s): Kurt W. Smith, Travis Oliphant, Aron J. Ahmadia, Andy Terrel

Abstract: The Python ecosystem empowers the HPC community with a stack of tools that are not only powerful but a joy to work with. It is consistently one of the top languages in HPC with a growing vibrant community of open source tools. Proven to scale on the worlds largest clusters, it is a language that has continued to innovate with a wealth of new data tools. This tutorial will survey the state of the art tools and techniques used by HPC Python experts throughout the world. The first half of the day will include an introduction to the standard toolset used in HPC Python and techniques for speeding Python and using legacy codes by wrapping Fortan and C. The second half of the day will include discussion on using Python in a distributed workflow via MPI and tools for handling Big Data analytics. Students should be familar with basic Python syntax, we recommend the Python 2.7 tutorial on python.org. We will include hands-on demonstrations of building simulations, wrapping low-level code, executing on a cluster via MPI, and use of big data tools. Examples for a range of experience levels will be provided.


Effective Procurement of Supercomputers

Date: Monday, November 18th
Time: 8:30am-12pm
Presenter(s): Andrew Jones, Jonathan Follows, Terry Hewitt

Abstract: In this tutorial we will guide you through the process of purchasing a cluster, HPC system or supercomputer. We will take you through the whole process from engaging with major stakeholders in securing the funding, requirements capture, market survey, specification of the tender/request for quote documents, engaging with suppliers, and evaluating proposals. You will learn how to specify what you want, yet enable the suppliers to provide innovative solutions beyond your specification both in technology and in the price, and then how to demonstrate to stakeholders the solution you select is value for money. The tutorial will be spilt into 3 major parts: the procurement process, the role of benchmarks and market surveys, and specification & evaluation. The presenters have been involved in major national and international HPC procurements since 1990 both as bidders and as customers. Whether you are spending $100k or $100M you will benefit from this tutorial.


Advanced PGAS Programming in UPC

Date: Monday, November 18th
Time: 8:30am-12pm
Presenter(s): Yili Zheng, Katherine Yelick

Abstract: Partitioned Global Address Space (PGAS) languages combine the convenience of shared memory programming with the locality control needed for scalability. There are several PGAS languages based on a variety of serial languages (e.g., Fortran, C, and Java) and different parallel control constructs, but all with a similar model for building shared data structures. This tutorial will focus on performance programming in the UPC language, which has a strict superset of ISO C. Following a short introduction to the basic constructs in UPC, the tutorial will cover effective parallel programming idioms used in UPC applications and performance optimization techniques. This will include the design of pointer-based data structures, the proper use of locks, and the memory consistency model. The tutorial will also cover the latest extensions of the language, both those in the current specification and others being considered for the next version of the language specification.


How to Analyze the Performance of Parallel Codes 101

Date: Monday, November 18th
Time: 8:30am-12pm
Presenter(s): Jennifer Green, James E. Galarowicz, Martin Schulz, Mathew P. Legendre, Don Maghrak

Abstract: Performance analysis is an essential step in the development of HPC codes. It will even gain in importance with the rising complexity of machines and applications that we are seeing today. Many tools exist to help with this analysis, but the user is too often left alone with interpreting the results. In this tutorial we will provide a practical road map for the performance analysis of HPC codes and will provide users step by step advice on how to detect and optimize common performance problems in HPC codes. We will cover both on-node performance and communication optimization and will also touch on threaded and accelerator-based architectures. Throughout this tutorial, we will show live demos using Open|SpeedShop, a comprehensive and easy to use performance analysis tool set, to demonstrate the individual analysis steps. All techniques will, however, apply broadly to any tool and we will point out alternative tools where useful.


An Overview of Fault-Tolerant Techniques for HPC

Date: Monday, November 18th
Time: 1:30pm-5pm
Presenter(s): Yves Robert, Thomas Herault

Abstract: Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications; (iii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication; and (iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young's approximation to Daly's formulas and recent work). The tutorial is open to all SC13 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites; background will be provided for all protocols and probabilistic models. Only the last part of the tutorial devoted to assessing the future of the methods will involve more advanced analysis tools.


Effective HPC Visualization and Data Analysis using VisIt

Date: Monday, November 18th
Time: 1:30pm-5pm
Presenter(s): Hank Childs, Jean M. Favre, Cyrus Harrison, Brad Whitlock, Harinarayan Krishnan

Abstract: Visualization is an essential component of the scientific discovery process. Scientists and businesses running HPC simulations leverage visualization tools for data exploration, quantitative analysis, visual debugging, and communication of results. This half-day tutorial will provide attendees with a practical introduction to mesh-based HPC visualization using VisIt, an open source parallel scientific visualization and data analysis platform. We will provide a foundation in basic HPC visualization practices and couple this with hands-on experience creating visualizations. This tutorial includes: 1) An introduction to visualization techniques for mesh-based simulations. 2) A guided tour of VisIt. 3) Hands on demonstrations of end-to-end visualizations of both a fluid simulation and climate simulation. This tutorial builds on the past success of VisIt tutorials, updated and anchored with new concrete use cases. Attendees will gain practical knowledge and recipes to help them effectively use VisIt to analyze data from their own simulations.


Introducing R: from Your Laptop to HPC and Big Data

Date: Monday, November 18th
Time: 1:30pm-5pm
Presenter(s): George Ostrouchov, Mark Schmidt

Abstract: The R language has been called the lingua franca of data analysis and statistical computing, and is quickly becoming the de facto standard for analytics. As such, R is the tool of choice for many working in the fields of machine learning, statistics, and data mining. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ``Programming with Big Data in R'' (pbdR) project. Although R has a reputation for lacking scalability, our initial experiments with pbdR have easily scaled to 12 thousand cores. No background in R is assumed but even R veterans will benefit greatly from the session. We will cover only those basics of R that are needed for the HPC portion of the tutorial. The tutorial is very much example-oriented, with many opportunities for the engaged attendee to follow along. Examples will utilize common data analytics techniques, such as principal components analysis and cluster analysis.


Questions: tutorials@info.supercomputing.org