Contents

    LLMs

    Why DeepSeek v3 matters in the world of LLMs

    The biggest AI plot twist at the end of 2024 did not come from Silicon Valley. It unexpectedly came from China. And it is called DeepSeek v3.

    Released during the festive period of 2024, DeepSeek v3 is an open-source (open-weights to be more precise) Large Language Model (LLM) that matched the performance of leading closed-source models like GPT-4o and Claude 3.5 Sonnet. It even outperforms them in coding challenges like CodeForces.

    What makes this release truly remarkable isn’t just the technical excellence displayed by the DeepSeek team, but the ways in which DeepSeek v3 may have fundamentally challenged everything we thought we knew about the economics of frontier AI model development.

    In this post, we’ll cover three key aspects of this groundbreaking development. First, we’ll provide more details about some of the technical details surrounding how DeepSeek v3 was trained. Then we will share why we believe DeepSeek’s achievements are so important for the AI industry. Finally, we’ll explore what we believe the second order effects - and beyond - of this release could be for AI and geopolitics.

    What is DeepSeek v3?

    Let’s start with the basics. At its core, DeepSeek v3 is a 671B parameter LLM trained on a dataset of 13.8 trillion tokens. It achieved State-Of-The-Art (SOTA) performance on popular AI benchmarks, rivalling - and in some cases surpassing - top tier models like gpt-4o and Claude 3.5 Sonnet.

    DeepSeek v3 performance against other leading LLMs - from DeepSeek v3 Technical Report

    For reference, AI models don’t “see” words; instead they see tokens, which are encodings for a group of letters. Therefore, some words could be composed of multiple tokens.  13.8 trillion tokens still represent a lot of words (trillions in fact)!

    But the real innovations from DeepSeek v3 are not really about how big the model is. The elements that make this LLM stand out are described below:

    Mixture of Experts

    DeepSeek employs a Mixture of Experts architecture. In fact, there are 256 experts within the model. However, the DeepSeek team has optimised the model so that only 8 of them process inputs at any one time, which means that instead of using all 671B parameters, only 32B of them are active at once.

    DeepSeek uses a special expert routing algorithm which prevents common problems like “routing collapse”. In simple words, routing collapse occurs when MoE architectures overuse a subset of experts throughout their training. Routing collapse leads to imbalance in the workload and can negatively impact model latency and performance.

    DeepSeek’s expert routing reduces both training and inference time due to reduced computational costs. The memory footprint of the model is also lessened with this approach.

    Multi-Token Prediction (MTP)

    Traditional LLMs are originally trained to simply predict the next word. However DeepSeek v3 predicts multiple tokens simultaneously. This is achieved by incorporating what they call “prediction modules” in the model’s architecture. Each prediction model is chained to the previous one in order to predict the next token. This chaining approach improves the model’s ability to understand long sequences, as it can better plan its output.

    Diagram of Multi-Token Prediction - from DeepSeek v3 Technical Report

    It’s important to understand that MTP was only conducted in during the training phase. Moreover, only two next tokens were predicted during this phase.

    By predicting future tokens during training, the model became better at learning the underlying structure and dependencies within the data. This leads to improved performance on the primary next token prediction task. The MTP modules are discarded at inference time because including them would only add unnecessary computational overhead.

    FP8 Training

    DeepSeek uses an FP8 mixed precision training framework. "FP" stands for "Floating Point," and 8 refers to the use of 8 bits. Traditionally, AI systems are trained on FP32 (32-bit) architectures to capture the very large and very small numbers that may arise from the many multiplications performed within a neural network. However, this high precision comes with significant memory and computational costs. FP8 significantly reduces the memory and computational resources required for training a model.

    Diagram of FP8 training framework- from DeepSeek v3 Technical Report

    FP8 significantly reduces the memory and computational resources required for training a model, but this normally has an adverse effect on model performance

    To maintain solid performance while using FP8, DeepSeek-V3 employs a mixed precision approach. This means that not all operations are performed in FP8; some operations, like the embedding module and the output head, are kept in higher precision to ensure numerical stability. The DeepSeek team also uses techniques like fine-grained quantization, where scaling factors are applied to smaller groups of elements, and increased-precision accumulation, which improves the accuracy of FP8 matrix multiplications

    DualPipe Algorithm

    The DeepSeek team noticed that communication between different nodes during cross-node expert parallelism caused bottlenecks in their training pipeline. For each computation, there was a significant communication overhead, resulting in a computation-to-communication ratio of roughly 1:1. To mitigate this issue, the team invented a scheduling algorithm called DualPipe.

    Example of DualPipe scheduling - from DeepSeek v3 Technical Report

    DualPipe is a novel approach to pipeline parallelism that yields more efficient training. It works by dividing the forward and backward passes of a neural network run into smaller chunks and rearranging these chunks to enable overlapping of computation and communication phases. This overlapping effectively hides the communication overhead, leading to faster training times.

    Beyond overlapping, DualPipe also reduces pipeline bubbles, which are periods of inactivity in the pipeline. This further improves the efficiency of the training process. Additionally, DualPipe is designed to maintain its efficiency even as the model scales up in size, as it can keep the communication overhead near zero, enabling continued scaling without significant performance penalties.

    Why DeepSeek v3 is a big deal

    Faster training cycle

    Let’s look into why these innovations taken together are so important. The DeepSeek team trained the entire model using just 2,048 NVIDIA H800 GPUs over 57 days, consuming only 2.7888 million GPU hours. To put this into perspective, Meta’s Llama 3.1 needed 30.8 million GPU hours - eleven times more compute time- despite having fewer parameters. This is a stark - and welcome - contrast to most frontier labs who boast about training loads spanning tens (or hundreds) of thousands of GPU.

    The implications above are already staggering and have shattered our assumptions about the resources needed to produce leading AI models. But it doesn’t stop here.

    Training costs are over 10x lower than Llama 3

    According to the DeepSeek team, it only cost $5.6M to train DeepSeek v3. This figure is advanced by the team because they assume a rental cost of $2 an hour per H800 GPU Hour. We wrote above that total training time was around 2.8 million GPU Hours, hence we can derive the $5.6M by doing the simple multiplication 2.8M GPU hour x $2/GPU Hour = $5.6M.

    For reference, it is reported that it cost Meta over $100M to train their Llama 3 405B model. We also presume that OpenAI and Anthropic spent similar sums to train their flagship models too. This means that the DeepSeek team spent over 10 times less money on training a competing SOTA model. This figure completely upends our assumptions on the economics of frontier model development.

    Summary of training costs of DeepSeek v3 - from DeepSeek v3 Technical Report

    However, the numbers claimed by the DeepSeek team should be scrutinised and further explained. The $5.6 figure omits costs associated with staff salaries for instance. Additionally, the DeepSeek team admits that the $5.6M figure pertains only to the training of DeepSeek v3, and excludes all prior research and experiments on architectures, algorithms and data. This means that the actual total R&D cost for DeepSeek v3 is much higher than the reported figures.

    Some further details on H800 GPUs

    Let’s revisit the hardware angle again and talk a bit more about those NVIDIA H800 GPUs. American frontier labs typically use NVIDIA H100 GPUs for training loads. H100 GPUs offer higher performance than H800 GPUs due to improvements in memory capacity and compute features. H800 GPUs were designed specifically for the Chinese market to comply with US export restrictions of compute resources to China.

    Given this handicap, the achievements from the DeepSeek team are even more commendable. Perhaps more importantly, it proves something that many long suspected: talent and out-of-the-box thinking, and not hardware and compute, are the real limited resources in this “AI Race”.

    Second Order Effects of DeepSeek

    We believe that the second-order effects (and beyond) for DeepSeek are far more significant than their initial achievements, which remains noteworthy.

    More organisations will build competitive models on limited budget

    We’re about to witness a wave of innovation from organisations that previously couldn’t afford to compete because they could not justify compute costs. These organisations won’t be only located in Silicon Valley; research labs and tech companies across the globe, universities, and companies in regions with restricted access to cutting-edge hardware now have the chance to join the frontier of AI development.

    All these organisations now have proof via DeepSeek’s achievements that indeed necessity is the mother of invention.

    More algorithmic innovations will emerge to overcome scaling laws

    DeepSeek’s breakthrough is likely to spark a wave of algorithmic innovations aimed at overcoming traditional scaling laws. Their paper, and early benchmark results suggest we’ve only scratched the surface of what’s possible through clever engineering.

    OpenAI and co have no permanent moat against open source

    DeepSeek v3 is the perfect counter-argument to those that claim that closed-source labs like OpenAI enjoy a strong moat against open-source models. It seems from recent examples that the performance from closed frontier models that be replicated within 12-18 months. Given these timelines, how can any lab or company maintain a real, long-lasting advantage?

    Accelerated frontier model development

    We’re likely to see an acceleration in the development of frontier models. Since DeepSeek’s innovations are now available to all, I expect leading and closed source frontier lab organisations to incorporate them into their training methodology. These techniques have the potential to bring down the time it takes to train SOTA models. This means that AGI and ASI timelines could be brought forward.

    Conclusion

    DeepSeek v3 represents much more than another large language model. It is instead a revelation and a wake up call for the entire industry. It shows that open-source AI development can bring the industry forward. This is exactly what Mark Zuckerberg advanced when he wrote that Open Source AI is the Path Forward in the summer of 2024.

    DeepSeek v3’s technical report is a treasure for technologists who believe in the power of smart engineering over brute force. It also raises the bar for excellence in frontier model development - both from a machine learning standpoint and an engineering dimension too. The DeepSeek team have recently upped the ante by releasing DeepSeek R1, a reasoning model that is on par with OpenAI’s o1 model. R1 is another groundbreaking model which deserves a separate post from us in the future.

    You can use DeepSeek v3 freely via web chat on DeepSeek’s website. They also provide an API offering where the cost per token is a fraction of the costs of competing AI models like Claude 3.5 Sonnet (which still remains our favourite) or gpt-4o.

    We believe that we’re about to experience an acceleration in frontier model development, driven in part by the stellar work from the DeepSeek team. At Kiseki Labs, we’re excited to help businesses prepare for and leverage this acceleration in intelligence. Reach out to us for a free AI consultation.

    More like this

    AI Agents

    December 1, 2024

    Why do we need AI Agents?

    In this post I provide some reasons for why AI Agents are a useful paradigm for developing LLM-based applications that can tackle increasingly complex problems. Read on to learn more!

    Read more

    AI Agents

    December 1, 2024

    Why AI Agents are the key to unlock Software 2.0

    In this post I explain why I believe that Andrej Karpathy's vision for "Software 2.0" is finally within reach thanks to AI Agents. Read on to learn more.

    Read more

    AI Agents

    December 1, 2024

    AI Agent paper review - ChatDev

    In this post I review the paper Communicative Agents for Software Development. It's an ambitious piece of work. Read on to learn more.

    Read more