Cerebras Slays GPUs, Breaks Record for Largest AI Models Trained on a Single Device

Cerebras Slays GPUs, Breaks Record for Largest AI Models Trained on a Single Device

Cerebras, the company behind the world’s largest accelerator chip in existence, the CS-2 Wafer Scale Engine, has just announced a milestone: the training of the world’s largest NLP (Natural Language Processing) AI model in a single device. While that in itself could mean many things (it wouldn’t be much of a record to break if the previous largest model was trained in a smartwatch, for instance), the AI ​​model trained by Cerebras ascended towards a staggering – and unprecedented – 20 billion parameters. All without the workload having to be scaled across multiple accelerators. That’s enough to fit the internet’s latest sensation, the image-from-text-generator, OpenAI’s 12-billion parameter DALL-E (opens in new tab).

The most important bit in Cerebras’ achievement is the reduction in infrastructure and software complexity requirements. Granted, a single CS-2 system is akin to a supercomputer all on its own. The Wafer Scale Engine-2 – which, like the name implies, is etched in a single, 7 nm wafer, usually enough for hundreds of mainstream chips – features a staggering 2.6 trillion 7 nm transistors, 850,000 cores, and 40 GB of integrated cache in a package consuming around 15kW.

Cerebras wafer scale engine

Cerebras’ Wafer Scale Engine-2 in all its wafer-sized glory. (Image credit: Cerebras)

Keeping up to 20 billion-parameter NLP models in a single chip significantly reduces the overhead in training costs across thousands of GPUs (and associated hardware and scaling requirements) while doing away with the technical difficulties of partitioning models across them. Cerebras says this is “one of the most painful aspects of NLP workloads,” sometimes “taking months to complete.”

It’s a bespoke problem that’s unique not only to each neural network being processed, the specs of each GPU, and the network that ties it all together – elements that must be worked out in advance before the first training is ever started. And it can’t be ported across systems.

Cerebras CS-2

Cerebras’ CS-2 is a self-contained supercomputing cluster that includes not only the Wafer Scale Engine-2, but also all associated power, memory and storage subsystems. (Image credit: Cerebras)

Pure numbers may make Cerebras’ achievement look underwhelming – OpenAI’s GPT-3, an NLP model that can write entire articles that may sometimes fool human readers, features a staggering 175 billion parameters. DeepMind’s Gopher, launched late last year, raises that number to 280 billion. The brains at Google Brain have even announced the training of a trillion-parameter-plus model, the Switch Transformer.

Leave a Comment

Your email address will not be published.