Speaker Details

LLMs, Fast and Everywhere: Acceleration for the Age of GenAI

Michael Gschwind

Michael Gschwind

Bio

Dr. Michael Gschwind is a Director / Principal Engineer for PyTorch at Meta Platforms. At Meta, he led the rollout of GPU Inference for production services and was the lead for the development of MultiRay and Textray, the first deployment of LLMs at scale exceeding 800 billion queries per day shortly after its rollout. Most recently, he has led enablement of Large Language Models on-device AI with mobile and edge devices.

Abstract

Large Language Models (LLMs) are a critical milestone on the path to Artificial General Intelligence (AGI). To deliver on this promise, all systems must become supercomputers to deliver the processing power behind LLMs, and eventually AGI. It was only in 2020 that we first demonstrated the use of Large Language Models (LLMs) in production using ASIC and GPU accelerators. Since then, accelerators have become the workhorse of AI, delivering sustainability and efficiency improvements necessary to deliver on the promise of AI at scale. Given how much is at stake, as a community we are exploring exciting new paths that have never been explored, such as low-precision integers, single byte and sub-byte floating point representations that were unimaginable until very recently, and ever more sophisticated collective protocols.

Combining innovations at the software and hardware levels, the AI, architecture and supercomputing communities have delivered several orders of magnitude improvements in efficiency to enable LLM deployment at scale, building on critical innovations such as Multiray/Textray that keep our communities safe and deliver on the promise of AI sustainability. With the recent announcement of torchchat and ExecuTorch, we are delivering the ability to deploy LLMs from servers to desktop and mobile/edge devices.