Speaker Details

Think Faster:1300 t/s/u on Llama3 with Groq software scheduled LPU Inference Engine

Igor Arsovski

Igor Arsovski

Bio

Igor Arsovski is Chief Architect & Fellow at Groq. Prior to Groq, Igor was at Google responsible for SoC technology strategy for Google Cloud and Infrastructure, leading and managing the technology and custom PD effort for the TPU. Prior to Google, Igor was the CTO of the GF/Marvell ASIC business unit responsible for setting ASIC strategy for >900-person team while leading technical solutions for Data Center and Automotive ASICs.

Igor is an IBM Master Inventor with >100 patents, >30 IEEE papers and presentations at premier system (SuperCompute, AI HW Summit, Linley, ML and AI Dev Com) and circuit (ISSCC, VLSI, CICC, ECTC) conferences, and currently serves on the VLSI Technology Program Committee.

Abstract

The race to model size continues. At the time of writing, some organizations are training 400B+ parameter Large Language Models (LLMs). The larger these training and inference workloads become, the more demand they create for GPU memory capacity. This has resulted in accelerators that leverage brute-force scaling and complex technology to get incremental performance and latency improvements. Groq addresses this scaling challenge in opposite ways to enable the lowest-latency AI LLM Inference Engine.

We'll explain how we co-designed a compilation-based software stack and a class of accelerators called Language Processing Units (LPUs). The core architectural decision underlying LPUs is determinism - systems built of clock-synchronized accelerators are programmed via detailed instruction scheduling. This includes networking, which is also scheduled statically. As a result, LPU-based systems have fewer networking constraints, which make their SRAM-based design practicable without the use of a memory hierarchy. This redesign results in high utilization and low end-to-end system latency. Moreover, determinism enables static reasoning about program performance - the key to kernel-free compilation.

We'll talk about the challenges of breaking models apart over networks of LPUs, and outline how this HW/SW system architecture keeps enabling breakthrough LLM inference latency at all model sizes.