Why DeepSeek v3 matters in the world of LLMs

The biggest AI plot twist at the end of 2024 did not come from Silicon Valley. It unexpectedly came from China. And it is called DeepSeek v3.

Released during the festive period of 2024, DeepSeek v3 is an open-source (open-weights to be more precise) Large Language Model (LLM) that matched the performance of leading closed-source models like GPT-4o and Claude 3.5 Sonnet. It even outperforms them in coding challenges like CodeForces.

What makes this release truly remarkable isn’t just the technical excellence displayed by the DeepSeek team, but the ways in which DeepSeek v3 may have fundamentally challenged everything we thought we knew about the economics of frontier AI model development.

In this post, we’ll cover three key aspects of this groundbreaking development. First, we’ll provide more details about some of the technical details surrounding how DeepSeek v3 was trained. Then we will share why we believe DeepSeek’s achievements are so important for the AI industry. Finally, we’ll explore what we believe the second order effects - and beyond - of this release could be for AI and geopolitics.

‍

What is DeepSeek v3?

Let’s start with the basics. At its core, DeepSeek v3 is a 671B parameter LLM trained on a dataset of 13.8 trillion tokens. It achieved State-Of-The-Art (SOTA) performance on popular AI benchmarks, rivalling - and in some cases surpassing - top tier models like gpt-4o and Claude 3.5 Sonnet.

DeepSeek v3 performance against other leading LLMs - from DeepSeek v3 Technical Report

For reference, AI models don’t “see” words; instead they see tokens, which are encodings for a group of letters. Therefore, some words could be composed of multiple tokens. 13.8 trillion tokens still represent a lot of words (trillions in fact)!

But the real innovations from DeepSeek v3 are not really about how big the model is. The elements that make this LLM stand out are described below:

Multi-Head Latent Attention (MLA)

The most important innovation of DeepSeek v3 is arguably its MLA mechanism. MLA stands for Multi-Head Latent Attention. To understand what MLA is and why it's important, we need to first understand the role of the KV cache in the Transformer architecture that powers all SOTA large language models.

LLMs receive tokens – and not words – as inputs, and output tokens in turn. While input tokens can be processed in parallel, output tokens are generally generated sequentially, though techniques like speculative decoding can introduce some parallelism. To generate a single output token, the attention calculations involving all previous input and output tokens need to be performed. If we were to naively perform these attention calculations for each of those tokens for a given output token, then the latency to generate tokens would be very high. Additionally, we would waste precious computation resources performing duplicate work.

Original Transformer Architecture. From Attention Is All You Need.

That’s where the KV cache comes in. The KV cache simply stands for 'Key-Value cache'. It's a special cache that stores the key and value vectors. These vectors are calculated from the input tokens (or, more precisely, from intermediate representations of those tokens) and are used in the attention mechanism. By storing these key and value vectors, we can reuse the results of previous attention calculations and speed up output token generation.

However, this efficiency comes at a cost: increased memory requirements. This is because the KV cache is stored in the GPU's high-bandwidth memory, which is limited and expensive. As a result, the memory required to store the growing KV cache can become a limiting factor as models process very large input and output token sequences. In this case, we have to employ cache eviction rules and other tricks to manage memory in the KV cache, but this leads to degraded latency in output token generation.

There have been many approaches proposed to address the limitations of the KV cache (e.g., Grouped-Query Attention (GQA), Multi-Query Attention (MQA), etc.), but they all had significant downsides. MLA, however, seems to address the issues with KV cache in an elegant way.

MLA Architecture - From DeepSeek v3 Technical Report

MLA works by compressing key and value vectors into a lower-dimensional space. These compressed vectors are called 'latent' vectors. During the forward pass, the model stores these latent vectors in the KV cache. When generating output tokens, the model retrieves these latent vectors from the cache. Instead of directly reconstructing the original key and value vectors, MLA uses the latent vectors to efficiently perform the necessary computations. This is achieved by merging the operations that would normally use the full key/value vectors with other calculations in the attention mechanism.

t's important to note that MLA doesn't simply 'uncompress' the latent vectors back to their original form. Rather, it leverages them in a different way to achieve the same outcome (attention). This approach retains much of the speed of the original KV cache lookup while addressing the memory limitations, as we can store many more key-value pairs within the compressed latent space.

‍

Mixture of Experts

DeepSeek employs a Mixture of Experts architecture. In fact, there are 256 experts within the model. However, the DeepSeek team has optimised the model so that only 8 of them process inputs at any one time, which means that instead of using all 671B parameters, only 32B of them are active at once.

DeepSeek uses a special expert routing algorithm which prevents common problems like “routing collapse”. In simple words, routing collapse occurs when MoE architectures overuse a subset of experts throughout their training. Routing collapse leads to imbalance in the workload and can negatively impact model latency and performance.

DeepSeek’s expert routing reduces both training and inference time due to reduced computational costs. The memory footprint of the model is also lessened with this approach.

Multi-Token Prediction (MTP)

Traditional LLMs are originally trained to simply predict the next word. However DeepSeek v3 predicts multiple tokens simultaneously. This is achieved by incorporating what they call “prediction modules” in the model’s architecture. Each prediction model is chained to the previous one in order to predict the next token. This chaining approach improves the model’s ability to understand long sequences, as it can better plan its output.

Diagram of Multi-Token Prediction - from DeepSeek v3 Technical Report

It’s important to understand that MTP was only conducted in during the training phase. Moreover, only two next tokens were predicted during this phase.

By predicting future tokens during training, the model became better at learning the underlying structure and dependencies within the data. This leads to improved performance on the primary next token prediction task. The MTP modules are discarded at inference time because including them would only add unnecessary computational overhead.

FP8 Training

DeepSeek uses an FP8 mixed precision training framework. "FP" stands for "Floating Point," and 8 refers to the use of 8 bits. Traditionally, AI systems are trained on FP32 (32-bit) architectures to capture the very large and very small numbers that may arise from the many multiplications performed within a neural network. However, this high precision comes with significant memory and computational costs. FP8 significantly reduces the memory and computational resources required for training a model.

FP8 significantly reduces the memory and computational resources required for training a model, but this normally has an adverse effect on model performance

To maintain solid performance while using FP8, DeepSeek-V3 employs a mixed precision approach. This means that not all operations are performed in FP8; some operations, like the embedding module and the output head, are kept in higher precision to ensure numerical stability. The DeepSeek team also uses techniques like fine-grained quantization, where scaling factors are applied to smaller groups of elements, and increased-precision accumulation, which improves the accuracy of FP8 matrix multiplications

DualPipe Algorithm

The DeepSeek team noticed that communication between different nodes during cross-node expert parallelism caused bottlenecks in their training pipeline. For each computation, there was a significant communication overhead, resulting in a computation-to-communication ratio of roughly 1:1. To mitigate this issue, the team invented a scheduling algorithm called DualPipe.

Example of DualPipe scheduling - from DeepSeek v3 Technical Report

‍

DualPipe is a novel approach to pipeline parallelism that yields more efficient training. It works by dividing the forward and backward passes of a neural network run into smaller chunks and rearranging these chunks to enable overlapping of computation and communication phases. This overlapping effectively hides the communication overhead, leading to faster training times.

Beyond overlapping, DualPipe also reduces pipeline bubbles, which are periods of inactivity in the pipeline. This further improves the efficiency of the training process. Additionally, DualPipe is designed to maintain its efficiency even as the model scales up in size, as it can keep the communication overhead near zero, enabling continued scaling without significant performance penalties.

‍

Why DeepSeek v3 is a big deal

‍

Faster training cycle

Let’s look into why these innovations taken together are so important. The DeepSeek team trained the entire model using just 2,048 NVIDIA H800 GPUs over 57 days, consuming only 2.7888 million GPU hours. To put this into perspective, Meta’s Llama 3.1 needed 30.8 million GPU hours - eleven times more compute time- despite having fewer parameters. This is a stark - and welcome - contrast to most frontier labs who boast about training loads spanning tens (or hundreds) of thousands of GPU.

The implications above are already staggering and have shattered our assumptions about the resources needed to produce leading AI models. But it doesn’t stop here.

‍

Training costs are over 10x lower than Llama 3

According to the DeepSeek team, it only cost $5.6M to train DeepSeek v3. This figure is advanced by the team because they assume a rental cost of $2 an hour per H800 GPU Hour. We wrote above that total training time was around 2.8 million GPU Hours, hence we can derive the $5.6M by doing the simple multiplication 2.8M GPU hour x $2/GPU Hour = $5.6M.

For reference, it is reported that it cost Meta over $100M to train their Llama 3 405B model. We also presume that OpenAI and Anthropic spent similar sums to train their flagship models too. This means that the DeepSeek team spent over 10 times less money on training a competing SOTA model. This figure completely upends our assumptions on the economics of frontier model development.

‍

Summary of training costs of DeepSeek v3 - from DeepSeek v3 Technical Report

‍

However, the numbers claimed by the DeepSeek team should be scrutinised and further explained. The $5.6 figure omits costs associated with staff salaries for instance. Additionally, the DeepSeek team admits that the $5.6M figure pertains only to the training of DeepSeek v3, and excludes all prior research and experiments on architectures, algorithms and data. This means that the actual total R&D cost for DeepSeek v3 is much higher than the reported figures.

Some further details on H800 GPUs

Let’s revisit the hardware angle again and talk a bit more about those NVIDIA H800 GPUs. American frontier labs typically use NVIDIA H100 GPUs for training loads. H100 GPUs offer higher performance than H800 GPUs due to improvements in memory capacity and compute features. H800 GPUs were designed specifically for the Chinese market to comply with US export restrictions of compute resources to China.

Given this handicap, the achievements from the DeepSeek team are even more commendable. Perhaps more importantly, it proves something that many long suspected: talent and out-of-the-box thinking, and not hardware and compute, are the real limited resources in this “AI Race”.

‍

Second Order Effects of DeepSeek

We believe that the second-order effects (and beyond) for DeepSeek are far more significant than their initial achievements, which remains noteworthy.

More organisations will build competitive models on limited budget

We’re about to witness a wave of innovation from organisations that previously couldn’t afford to compete because they could not justify compute costs. These organisations won’t be only located in Silicon Valley; research labs and tech companies across the globe, universities, and companies in regions with restricted access to cutting-edge hardware now have the chance to join the frontier of AI development.

All these organisations now have proof via DeepSeek’s achievements that indeed necessity is the mother of invention.

More algorithmic innovations will emerge to overcome scaling laws

DeepSeek’s breakthrough is likely to spark a wave of algorithmic innovations aimed at overcoming traditional scaling laws. Their paper, and early benchmark results suggest we’ve only scratched the surface of what’s possible through clever engineering.

OpenAI and co have no permanent moat against open source

DeepSeek v3 is the perfect counter-argument to those that claim that closed-source labs like OpenAI enjoy a strong moat against open-source models. It seems from recent examples that the performance from closed frontier models that be replicated within 12-18 months. Given these timelines, how can any lab or company maintain a real, long-lasting advantage?

Accelerated frontier model development

We’re likely to see an acceleration in the development of frontier models. Since DeepSeek’s innovations are now available to all, I expect leading and closed source frontier lab organisations to incorporate them into their training methodology. These techniques have the potential to bring down the time it takes to train SOTA models. This means that AGI and ASI timelines could be brought forward.

Conclusion

DeepSeek v3 represents much more than another large language model. It is instead a revelation and a wake up call for the entire industry. It shows that open-source AI development can bring the industry forward. This is exactly what Mark Zuckerberg advanced when he wrote that Open Source AI is the Path Forward in the summer of 2024.

DeepSeek v3’s technical report is a treasure for technologists who believe in the power of smart engineering over brute force. It also raises the bar for excellence in frontier model development - both from a machine learning standpoint and an engineering dimension too. The DeepSeek team have recently upped the ante by releasing DeepSeek R1, a reasoning model that is on par with OpenAI’s o1 model. R1 is another groundbreaking model which deserves a separate post from us in the future.

You can use DeepSeek v3 freely via web chat on DeepSeek’s website. They also provide an API offering where the cost per token is a fraction of the costs of competing AI models like Claude 3.5 Sonnet (which still remains our favourite) or gpt-4o.

We believe that we’re about to experience an acceleration in frontier model development, driven in part by the stellar work from the DeepSeek team. At Kiseki Labs, we’re excited to help businesses prepare for and leverage this acceleration in intelligence. Reach out to us for a free AI consultation.

‍

Contents