Welcome to the High-Performance Computing Technologies Course
This is a two week course introducing the basics of building, configuring and operating High-Performance Computing Clusters. The goal of this course is to give students a solid foundation for understanding the components of a cluster, how they are typically configured, and what challenges operating such a machine entails.
Over these two weeks students will learn first hand how set up their own clusters from scratch. We will explain the necessary network administration basics to provision cluster nodes over the network and will learn how to manage multiple systems without physical access.
Automation and a uniform configuration play a central role in managing such large systems. We will use modern techniques of configuration management and show how to use Ansible to simplify repetitive tasks.
One central component of a cluster is its jobs batch system and scheduler. Students will learn how to set up the necessary software environment to enable this common workflow, learn how to install and manage software on multiple compute nodes using software modules, and configure essential software components for massively parallel applications.
The course will conclude by showcasing and discussing real world deployments.
ToDo List
Networking Basics
Hardware
Exercises
- Exercise 1: Configuring hardware and network
- Exercise 2: Setting up a minimal netboot environment
- Exercise 3: Setting up a netboot environment with cobbler
- Exercise 4: Setting up a cluster environment
- Exercise 5: RPM repositories with Cobbler
- Exercise 6: Software modules
- Exercise 7: Configuration with Ansible
- Exercise 8: Creating a Batch System with Slurm
- Exercise 9: OpenMPI and InfiniBand
- Exercise 10: Monitoring with Ganglia
Appendix