Salta ai contenuti. | Salta alla navigazione

Strumenti personali

ALGORITHMS FOR PARALLEL COMPUTING

Academic year and teacher
If you can't find the course description that you're looking for in the above list, please see the following instructions >>
Versione italiana
Academic year
2017/2018
Teacher
ENRICO CALORE
Credits
6
Didactic period
Primo Semestre
SSD
MAT/08

Training objectives

The main objective of this course is to provide students with the basics of parallel computing, for both distributed and shared memory architectures.

Acquired knowledge will be:
- characteristics of modern architecture for parallel computing;
- metrics for evaluating the performance of parallel codes based on both source code analysis (eg. CGMA evaluation for GPU code), and through computing time measure (speedup, efficiency and Kuck function for MPI implementations);
- introduction to MPI paradigm for distributed memory architectures: point-to-point and group communications, reduction and synchronization directives;
- introduction to OpenMP and OpenACC for shared memory architectures;
- elements of hardware architecture of a GPU and their impact on programming methodologies;
- elements of CUDA framework: memory memory, device-host data transfer, management of thread blocks and shared memory;
- CUDA events for the measurement of the running and transfer time;
- examples of use of widely used libraries for linear algebra, factorization and eigenvalue problems: BLAS, ATLAS, LAPACK, MKL and their parallel implementation;
- examples of use of different profilers for CPU and GPU applications performance analysis;

The main skills are:
- Analyze serial algorithms and identify a correct workload partitioning methodology to get a relevant speedup;
- Analyze serial algorithms and identify data access patterns to maximize aligned and contiguous memory accesses.
- Analyze code performances when executing on actual architectures.

Prerequisites

Students need to master the following prerequisites, provided by the course of "Analisi Numerica":
- QR factorization, LU factorization and its use for solving linear systems;
- Floating point representation, machine precision, finite arithmetic;

Familiarity with C language programming (variables and functions definitions, fow control techniques, dynamic memory allocation, pointer arithmetic) is advised.
Any basic C language course, eg. "Programmazione e laboratorio" is suggested.

Course programme

Introduction
- introduction to parallel computing, CPU architecture, computational cost, matrix and arrays in C language;
- Flynn classification of architectures, performance measures (speedup, efficiency, function Kuck), C code to measure serial computational time;
- Example: block-based implementation of matrix by matrix product;

Using MPI - introduction to MPI, scripts for compiling and launching parallel codes, locking and not-locking point-to-point communications, definition of deadlock, send-receive deadlock and it solution using sendreceive function, collective communications, reductions, synchronization directives;
- Example: mergesort algorithm implementation;
- Example: matrix by matrix implementation;

Using CUDA
- Introduction to CUDA, hardware architecture description, NVIDIA provided libraries; on-board and on-chip memories, threads management through blocks and grids, kernel definition and launch, warp threads and divergence in the flow of execution, use of CUDA events;
- Example: matrix by matrix parallelization using the shared memory block;
- Example: solve the problems of coalescence of data and divergence of the threads in the case of the reduction operation;
- Example: the ability to identify patterns of reuse of data in the case of matrix-vector;

Insights:
 - posix threads, OpenMP, OpenACC; examples of use of tools gdb, gprof, valgrind;

Examples of parallelization of common algorithms
- parallelization of the LU factorization for distributed memory;
- Cuppen algorithm for the eigenvalues of a tridiagonal matrix and its simplified implementation in MPI;
- parallelization of Neville's method for the evaluation of the interpolating polynomial in CUDA;
- modified Gram-Schmidt method in CUDA;
- Solution of a block-bordered system in MPI environment.

Didactic methods

Main topics will be introduced thanks to slide presentations. Practical examples will be conducted on the coka.unife.it cluster.
Students will be able to use the cluster both in class and during the Academic Year.

Learning assessment procedures

Learning objectives level will be measured through an oral test.
The students will have to send by mail the source code of one of the parallelization examples discussed in class, along with a brief report which summarizes the performance of their code in relation to the number of processors used, varying the size of the input data.

During the oral presentation, students must be able to retrace the key stages of the parallelization process followed, emphasizing the key elements of the approach.
Further questions will deal with topics discussed during the whole course, but not directly covered in the project.
The final grade will take into account both the capacity shown in the implementation phase of the project and the answers provided to subsequent questions.

Reference texts

- Lecture notes.
- Textbook: V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: design and analysis of Algorithms, Addison-Wesley, 2003
- Suggested for further study: D.P. Bertsekas, J. Tsisiklis, Parallel and Distributed Computation: Numerical Methods, Trentine-Hall, 1989.