Cette journée co-organisée par le GDR ISIS (thème C) et le Groupe Calcul a pour but de recenser les challenges et travaux en cours sur le thème de l'accélération des calculs scientifiques sur FPGA. La plus grande maturité des outils de synthèse de haut niveau offerts par les industriels Intel-Altera et AMD-Xilinx rend plus accessible la puissance de calcul des FPGA. Pour les électroniciens, le temps de développement réduit de ces outils permet ainsi un prototypage plus rapide de leurs designs. Pour les informaticiens du domaine HPC (High Performance Computing), le FPGA devient une architecture d'accélération alternative aux GPUs accessible avec un flot de développement devenu purement logiciel côté utilisateur.
La journée sera organisée en mode hybride en offrant la possibilité d'y participer via un système de visioconférence. Merci de contacter email@example.com et firstname.lastname@example.org pour les demandes d'inscriptions en distanciel afin de recevoir le lien de la visio. En présentiel, les inscriptions sont closes.
Cette journée de séminaires et de retours d'expérience clôturera une formation de 4 jours organisée par le Groupe Calcul sur ce même thème : https://calcul.math.cnrs.fr/2022-07-atelier-fpga.html (inscriptions closes). A noter que les présentations (hors TPs) des journées du 3, 4 et 5 juillet seront accessibles sur demande en distanciel.
Comité d'organisation (Formation Groupe Calcul + Journée GDR ISIS/Groupe Calcul)
- Mickaël Dardaillon (INSA Rennes / IETR)
- Nicolas Gac (L2S - Université Paris Saclay)
- Matthieu Haefele (CNRS/UPPA)
- Shan Mignot (Laboratoire Lagrange)
- Charles Prouveur (CEA)
- Antsa Randriamanantena (CNRS/LAB)
- Bogdan Vulpescu (CNRS/IN2P3/UCA)
9h00-9h30 : Accueil café
9h30-9h45 : Welcome
Matthieu Haefele and Nicolas Gac
9h45-10h30 : Kentaro Sano (RIKEN Center for Computational Science, Japan)
ESSPER: FPGA Cluster for Research on Reconfigurable HPC with Supercomputer Fugaku
10h30-11h15 : Stefano Corda (EPFL)
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Xilinx FPGAs
11h15-11h45 : Pause café
11h45-12h30 : Omar Hammami (ENSTA)
Heterogeneous Embedded Multicore Design Graduate Education in ENSTA PARIS: A 5 years Feedback
12h30-14h00 : Pause déjeuner
14h00-15h00 : Christophe ALIAS (INRIA / LIP, ENS-Lyon)
Compiling circuits with polyhedra
15h00-15h45 : Steven Derrien (ISTIC-IRISA)
Toward Speculative Loop Pipelining for High-Level Synthesis
17h45-16h15 : Pause
16h15-16h45 (visio) : Suleyman Demirsoy (Intel)
Using Unified Shared Memory and External Function Interface with oneAPI
16h45-17h45 : Table ronde
Kentaro Sano (RIKEN Center for Computational Science, Japan)
Titre : ESSPER: FPGA Cluster for Research on Reconfigurable HPC with Supercomputer Fugaku
Résumé : At RIKEN Center for Computational Science (R-CCS), we have been developing an experimental FPGA Cluster named "ESSPER (Elastic and Scalable System for high-PErformance Reconfigurable computing)," which is a research platform for reconfigurable HPC. ESSPER is composed of sixteen Intel Stratix 10 SX FPGAs which are connected to each other by a dedicated 100Gbps inter-FPGA network. We have developed our own Shell (SoC) and its software APIs for the FPGAs supporting inter-FPGA communication. The FPGA host servers are connected to a 100Gbps Infiniband switch, which allows distant servers to remotely access the FPGAs by using a software bridged Intel's OPAE FPGA driver, called R-OPAE. By 100Gbps Infiniband network and R-OPAE, ESSPER is actually connected to the world's fastest supercomputer, Fugaku, deployed in RIKEN, so that using Fugaku we can program bitstreams onto FPGAs remotely using R-OPAE, and off-load tasks to the FPGAs. In this talk, I introduce our ESSPER's concept, system stack of hardware and software, programming environment, under-development applications as well as our future prospects for reconfigurable HPC.
Stefano Corda (EPFL)
Titre : Reduced-Precision Acceleration of Radio-Astronomical Imaging on Xilinx FPGAs
Résumé : Modern radio telescopes such as the Square Kilometre Array (SKA) produce large volumes of data that need to be processed to obtain high-resolution sky images. This is a complex task that requires computing systems that provide both high performance and high energy efficiency. Hardware accelerators such as GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays) can provide these two features and are thus an appealing option for this application. Most HPC (High-Performance Computing) systems operate in double precision (64-bit) or in single precision (32-bit), and radio-astronomical imaging is no exception. With reduced precision computing, smaller data types (e.g., 16-bit) aim at improving energy efficiency and throughput performance in noise-tolerant applications. We demonstrate that reduced precision can also be used to produce high-quality sky images. To this end, we analyze the gridding component (Image-Domain Gridding) of the widely-used WSClean imaging application. Gridding is typically one of the most time-consuming steps in the imaging process and, therefore, an excellent candidate for acceleration. We identify the minimum required exponent and mantissa bits for a custom floating-point data type. Then, we propose the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis. Our reduced-precision implementation improves the throughput and energy efficiency by respectively 1.84x and 2.03x compared to the single-precision floating-point baseline on the same FPGA. Our solution is also 2.12x faster and 3.46x more energy-efficient than an Intel i9 9900k CPU (Central Processing Unit) and manages to keep up in throughput with an AMD RX 550 GPU.
Omar Hammami (ENSTA)
Titre : Heterogeneous Embedded Multicore Design Graduate Education in ENSTA PARIS: A 5 years Feedback
Résumé : In this talk we will present a 5 years feedback on training graduate level students in the oldest school of engineers in France, ENSTA PARIS on heterogeneous embedded multicore design on Xilinx SOC Zynq chip. Part of the ROB 307 MPSOC (Multiprocessor System on Chip) course students are required to design a Heterogeneous Embedded Multicore combining a dual core hardcore IP (ARM9, 4 soft cores Microblaze, 2 hardware accelerators (Neural network, vision, image processing) and a AXI NOC (Network on Chip) on a single Zynq XC7Z020 chip using a zedboard. Students are expected to validate their design through actual execution on the zedboard and have all IPs running concurrently. This project have been going on for the past 5 yeat and we will share our experience in this training.
Christophe ALIAS (INRIA / LIP, ENS-Lyon)
Titre : Compiling circuits with polyhedra
Résumé : Hardware accelerators are unavoidable to improve the performance of computers with a bounded energy budget. In particular, FPGA allow building dedicated circuits from a gate-level description, allowing a very advanced level of optimization. Tools for high-level synthesis (HLS) allow the programmer to program FPGA without the constraints linked to hardware, compiling a C specification into a circuit. Code optimizations in these tools remain rudimentary (loop unrolling, pipelining, etc.), and most often the responsibility of the programmer. Polyhedral model, born from research on systolic circuits, offer a powerful tool to optimize compute kernels for HPC. In this seminar, I will show a few interconnections between HLS and the polyhedral model, either as a preprocessing (source-to-source) step, or as a synthesis tool (optimizing the circuit using a dataflow intermediate representation). In particular, I will present a dataflow formalism that allow reasoning geometrically on circuit synthesis.
Steven Derrien (ISTIC-IRISA)
Titre : Toward Speculative Loop Pipelining for High-Level Synthesis
Résumé : Loop pipelining (LP) is a key optimization in modern high-level synthesis (HLS) tools for synthesizing efficient hardware datapaths. Existing techniques for automatic LP are limited by static analysis that cannot precisely analyze loops with data-dependent control flow and/or memory accesses. We propose a technique for speculative LP that handles both control-flow and memory speculations in a unified manner. Our approach is entirely expressed at the source level, allowing a seamless integration to development flows using HLS. Our evaluation shows significant improvement in throughput over standard loop pipelining techniques.
Suleyman Demirsoy (Intel) en visio
Titre : Using Unified Shared Memory and External Function Interface with oneAPI
: Unified Shared memory(USM) abstraction offers significant ease of use and, in some cases, performance benefits when critical functions are offloaded to an accelerator such as FPGA. Some of the critical functions would also benefit from lower level customization that is possible at RTL level but not easy to capture at the oneAPI code. In this talk, we will look more closely into both topics as a follow up on the main oneAPI introduction presented earlier in the conference.