The Fourteenth International Workshop on
Accelerators and Hybrid Emerging Systems (AsHES) To be held in conjunction with 38th IEEE International Parallel and Distributed Processing Symposium San Fransisco, California, USA May 27, 2024 |
||
Opening Remarks
10:30 am - 10:40 am
Session 1: High-Performance Computing
10:40 am - 12:00 pm
Session Chair: Shintaro Iwasaki, Meta
-
10:40 am - 11:00 am
Performance Versus Maintainability: A Case Study of Scream on Frontier
James White -
11:00 am - 11:30 am
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels
Ali Tehranijamsaz, Alok Mishra, Akash Dutta, Abid M. Malik, Barbara Chapman, and Ali Jannesari -
11:30 am - 12:00 pm
Alternative Quadrant Representations with Morton Index and AVX2 Vectorization for AMR Algorithms within the p4rest Software Library
Mikhail Kirilin and Carsten Burstedde
Lunch Break
12:00 pm - 1:00 pm
- Lunch will not be provided by the conference.
Keynote
1:00 pm - 2:00 pm
Block-based GPU Programming with Triton
Philippe Tillet, OpenAI
Abstract: Traditional single instruction, multiple threads (SIMT) programming with CUDA can be daunting to machine learning researchers in need of fast custom kernels. This can significantly slow down the evaluation of novel research ideas that cannot be neatly decomposed into a set of pre-built, vendor-optimized primitives. In this talk, we will shed light on an alternative programming model which -- while relatively high-level -- aims to be more expressive than common graph-compilers (e.g., XLA, Torch-Inductor) and enable the use of custom data-structures (e.g., linked list, block-sparse tensors, etc.). We will specifically discuss the design and implementation of Triton, a mid-level programming language that uses block-based abstractions to simplify kernel development for researchers without deep GPU programming expertise.
Bio: Philippe Tillet first began working with GPUs in 2011 as a contributor to the ViennaCL library. He then received his B.S. from Telecom SudParis (France) in 2012, his M.S. from NCTU (Taiwan) in 2014, and his Ph.D. from Harvard University in 2020. He joined OpenAI full time in 2020 to pursue his work on the Triton compiler — a project he started in 2018 after being frustrated by the difficulty of writing auto-tuners for matrix multiplications in CUDA. Since then, he grew the Triton language into a reference for block-based programming model, and used it to write all the training kernels that were used by GPT4.
Session 2: Accelerating AI/ML Workloads
2:00 pm - 3:10 pm
Session Chair: Carl Pearson, Sandia National Laboratories
-
2:00 pm - 2:30 pm
Avoiding Training in the Platform-Aware Optimization Process for Faster DNN Latency Reduction
Raúl Marichal, Ernesto Dufrechou, and Pablo Ezzatti -
2:30 pm - 2:50 pm
A Comparative Study on Simulation Frameworks for AI Accelerator Evaluation
Christoffer Åleskog, Håkan Grahn, and Anton Borg -
2:50 pm - 3:10 pm
Extending the SYCL Joint Matrix for Binarized Neural Networks
Zheming Jin