SkePU · Tutorial Abstract · Outline · Prerequisite Knowledge · Special Requirements · Organizers · PPoPP'21

SkePU Tutorial at ACM PPoPP'21 Conference, 27 Feb. 2021

Portable Programming of Heterogeneous Parallel Systems with SkePU

Christoph Kessler, August Ernstsson, Johan Ahlqvist
Linköping University, Linköping, Sweden

Tutorial Abstract

SkePU is a C++ based open-source framework for high-level portable programming of heterogeneous parallel systems using algorithmic skeletons. Skeletons are generic programming constructs based on higher-order functions, such as map, stencil, reduce, scan, that implement algorithmic patterns for which platform-specific parallel implementations may exist. Skeletons provide a high degree of abstraction and portability with a quasi-sequential programming interface, as their implementations encapsulate all low-level and platform-specific details such as parallelization, synchronization, communication, memory management, accelerator usage and other optimizations.

From the same high-level source code using SkePU skeletons and data-containers, the framework generates target code for multicore CPU execution, CUDA and OpenCL code for execution on GPU and multi-GPU systems (and other OpenCL devices such as Xeon-Phi), hybrid CPU-GPU execution, and on HPC clusters. SkePU transparently performs a number of run-time optimizations, including data transfer minimization, data locality optimizations and auto-tuned back-end selection.

The SkePU framework is organized as an include-only template library plus a light-weight pre-compiler based on LLVM clang; the cluster backend uses the MPI support in the StarPU runtime system. The development of SkePU started in 2010, and we recently presented the third generation of the framework. Beyond cooperative research projects such as H2020 EXA2PRO, we use SkePU since several years as a lab environment for teaching high-level parallel programming in our master-level course on multicore and GPU programming at Linköping University, Sweden.

In this tutorial we present SkePU with its skeletons and data-container types, guide participants through some hands-on programming exercises in SkePU, and also point out some of the more recently added new features such as multi-variant user functions (which can be used by platform expert programmers to exploit e.g. custom SIMD instructions without compromising the portability of the SkePU program). The source code of SkePU is available at skepu.github.io; for fast installation we also provide a binary distribution of SkePU for x86-64 Linux.

The objective of this tutorial is to provide a good overview of the capabilities and ease of use of SkePU to potential users in industry and academia, as well as to researchers and teachers in the PPoPP scientific community.

Contents / Schedule

Introduction to parallel programming patterns and skeleton programming in general (ca. 30min, C.K.)

Overview of SkePU (fundamental concepts including code examples): skeletons, data-containers and container proxies, backends, managed and unmanaged scopes, memory coherence management and I/O encapsulation, SkePU framework structure (ca. 45min, A.E./C.K.)

Coffee break and help with installation where required (A.E., J.A.)

Walk-through of a complete SkePU example (ca. 30min, A.E., J.A.)

Advanced issues in SkePU: backend selection tuning, multi-return, multi-variant user functions for custom SIMDization. Optimizations in SkePU. (ca. 45min, C.K., A.E.)

Another hands-on programming exercise, as far as time permits. (A.E., J.A.)

Prerequisite Knowledge

General interest and some background in parallel and heterogeneous computing (this should be no problem for anyone attending PPoPP).

For the programming exercises, some background in modern C++ and programming under Linux.

Special Requirements

SkePU runs under Linux. We encourage participants to bring their laptop for hands-on programming in SkePU, and we try to help with installation issues if there should be any.

We will also demo some examples on screen for all participants who prefer not to try it out immediately.

We encourage participants to install SkePU before the tutorial so we can make better use of the time.

Option 1: Use provided binary tutorial distribution

The source-to-source compilation tool is provided as a binary for x86-64 Linux systems. This is recommended for the tutorial, as it is quicker to get started.

Download the archive (external link)

To build a SkePU program, navigate to the examples subdirectory and run make parallel/map, where map may be replaced by the file name of the target source file (without extension). The resulting program binary is located at ./parallel/map.

Option 2: Build from source

Clone the SkePU GitHub repository and follow the instructions in the User Guide.

This will build the source-to-source compiler tool from the LLVM/Clang sources. This option is recommended for long-term SkePU development, but requires more disk space and compilation time (up to an hour depending on system).

Follow the user guide for instructions how to build the example programs with Cmake.

About the Organizers

Christoph Kessler is a professor for Computer Science at Linköping University, Sweden, where he leads the Programming Environment Laboratory's research group on compiler technology and parallel computing. His research interests include parallel programming models, compiler technology, code generation, combinatorial optimization, and software composition/synthesis. He has published two books, several book chapters and more than 120 scientific papers in international peer-reviewed journals and conferences.

August Ernstsson, Tek.Lic., M.Sc. in computer engineering, is a final-year PhD student at Linköping University and the main developer of SkePU since 2016. He has published 5 articles about his work on SkePU in international peer-reviewed journals and conferences. He recently defended his Licentiate thesis about the SkePU 3 design. His main research interests are in high-level parallel programming interface design targeting multi-core heterogeneous systems and clusters. His recent work concerns how modern C++ can be leveraged and adapted to fit this purpose.

Johan Ahlqvist is a research engineer in the SkePU development team at Linköping University. His research interests are in parallel computing, advanced C++ and compiler technology. He has developed the cluster backend for SkePU atop StarPU-MPI and the SkePU plug-in to the Mercurium compiler framework.

Exercises

1. Modular addition

Recommended skeletons: Map

Implement the modular additions example from the lecture. Code snippets are provided in the slides, but you will have to integrate the snippets in a full program.

2. Linear combination of vectors

Recommended skeletons: Map

a) Write a SkePU program that takes a Vector<float> v and a scalar a, and computes a Vector<float> that for each element index i contains v[i] * a.
Consider the difference between element-wise containers and scalars as arguments to a SkePU skeleton.

b) Extend the program so that it instead takes two Vector<float> and two scalars a and b, and computes a Vector<float> that for each element index i contains v1[i] * a + v2[i] * b. 
What changes need to be made?

c) (Advanced) To make this program handle a dynamic amount of vectors and scalars, they would have to be embedded within a Matrix<float> and a vector, respectively.
 Write a new program that performs this computation. 
There are several ways to approach this. Carefully consider how SkePU performs data parallelism on container elements in Map with matrices before you start.


Hint: Element index retrieval in the user function can be useful.

Hint: Using Map + Reduce in sequence can be worth considering.

3. Complex multiplication

Recommended skeletons: Map

Implement a SkePU program that performs element-wise complex multiplication of two Vector<MyComplex>. You will have to define a struct MyComplex with real and imaginary float members.

4. Data generation

Recommended skeletons: Map, MapReduce (b)

a) For a certain application, a Vector<float> needs to be pre-initialized with, for some float x, x raised to subsequent powers. So for each index i, the returned value shall be pow(x, i).

Hint: Element index retrieval in the user function can be useful.

Hint: Remember the element-wise "arity" of a Map skeleton can be 0.

b) The application in question is Taylor series approximation. Create a new program (base it on the solution in a) that computes a scalar value that is the approximation for log(1+x) for some float x.

Hint: You may want to use the MapReduce instance function setDefaultSize(size_t) for this.

Hint: It is possible to encode this in a single MapReduce call. The only input to the skeleton call will then be x.

5. Minimum, maximum, average

Recommended skeleton: Reduce

Create a SkePU program that, from a Vector<float>, computes the minimum, maximum, and average values. Is it possible to do this in one single Reduce call in SkePU?

Hint: Reduce relies on the associativity property of the user function.

6. Average sum-of-rows

Recommended skeleton: Reduce

Write a SkePU program that, from a Matrix<float>, computes the sum of each row and returns the average of those sums in the matrix.

Hint: Reduce has 2D reduce modes for this purpose.

7. Prefix count

Recommended skeleton: Scan

Write a SkePU program, that, given a Vector<float>, returns a Vector which in each element contains the count of non-zero elements up until that point.

8. Rolling average

Recommended skeleton: MapOverlap

a) Write a SkePU program that, given a Vector<float>, computes a Vector<float> containing the rolling averages (the average value of r elements up to and including the current one).

b) Extend your program so that more recent elements (closer) have a larger impact to the weighted average. The weights shall be (1 - distance/radius).

c) Make the weights dynamic by supplying them in a Vector<float> to the user function, as a random-access argument.

Hint: The size of the weight vector is the overlap radius + 1.

9. Heat Propagation

Recommended skeleton: MapOverlap

Create a SkePU program that performs iterative heat propagation in a rectangular, 2D grid. For each iteration, the new value for each element is assigned the average of the four neighbouring values.

Hint: You will want to place iteration in the main program, not in the user function.

Hint: You will need to double-buffer the grid. How can this be neatly done in the C++ interface of SkePU?

SkePU · Tutorial Abstract · Outline · Prerequisite Knowledge · Special Requirements · Organizers · PPoPP'21