What is DeepSeek? Sorting through the hype

The January 2025 release of DeepSeek-R1 initiated an avalanche of articles about DeepSeek—which, somewhat confusingly, is the name of a company and the models it makes and the chatbot that runs on those models. Given the volume of coverage and the excitement around the economics of a seismic shift in the AI landscape, it can be hard to separate fact from speculation and speculation from fiction.

Key Takeaways

  • DeepSeek is an AI research company based in China, known for its open-weight generative AI models.
  • DeepSeek-R1 is a reasoning model fine-tuned from the DeepSeek-V3 LLM, emphasizing step-by-step thought processes.
  • DeepSeek-V3 is a 671 billion parameter mixture of experts (MoE) model, excelling in math, reasoning, and coding.
  • The reported $5.5 million cost refers specifically to the final pre-training run of DeepSeek-V3 and excludes other significant expenses.
  • DeepSeek-R1-Distill models are smaller models fine-tuned to mimic DeepSeek-R1’s behavior, not the actual DeepSeek-R1.
  • Misleading reporting has exaggerated DeepSeek’s capabilities and cost-efficiency in some instances.
  • DeepSeek’s innovations, particularly in MoE architecture, are pushing the boundaries of AI efficiency and performance.

What follows is a straightforward guide to help you sort through other articles about DeepSeek, separate signal from noise and skip over hype and hyperbole. We’ll start with some brief company history, explain the differences between each new DeepSeek model and break down their most interesting innovations (without getting too technical).

Here’s a quick breakdown of what we’ll cover:

  • What is DeepSeek?
  • What exactly is DeepSeek-R1? We’ll explain the fine-tuning process (“R1”) and the large language model (LLM)—DeepSeek-V3—that they fine-tuned with it.
  • What’s DeepSeek-V3? We’ll walk through how it’s different from other LLMs.
  • How much does DeepSeek-R1 cost? We’ll clear up some major misconceptions.
  • What is DeepSeek-R1-Distill? Despite their names, the R1-Distill models are fundamentally different from R1.
  • Why do you need to know this? We’ll highlight how headlines can be misleading.
  • What comes next?

What is DeepSeek?

DeepSeek is an AI research lab based in Hangzhou, China. It is also the name of the open weight generative AI models it develops. In late January 2025, their DeepSeek-R1 LLM made mainstream tech and financial news for performance rivaling that of top proprietary models from OpenAI, Anthropic and Google at a significantly lower price point.

The origins of DeepSeek (the company) lie in those of High-Flyer, a Chinese hedge fund founded in 2016 by a trio of computer scientists with a focus on algorithmic trading strategies. In 2019, the firm used proceeds from its trading operations to establish an AI-driven subsidiary, High-Flyer AI, investing a reported USD 28 million in deep learning training infrastructure and quintupling that investment in 2021.

By 2023, High-Flyer’s AI research had grown to the extent that it warranted the establishment of a separate entity focused solely on AI—more specifically, on developing artificial general intelligence (AGI). The resulting research lab was named DeepSeek, with High-Flyer serving as its primary investor. Beginning with DeepSeek-Coder in November 2023, DeepSeek has developed an array of well-regarded open-weight models focusing primarily on math and coding performance.

In December 2024, the lab released DeepSeek-V3, the LLM on which DeepSeek-R1 is based. The breakthrough performances of DeepSeek-V3 and DeepSeek-R1 have positioned the lab as an unexpected leader in generative AI development moving forward.

What Exactly is DeepSeek-R1?

DeepSeek-R1 is a reasoning model created by fine-tuning an LLM (DeepSeek-V3) to generate an extensive step-by-step chain of thought (CoT) process before determining the final “output” it gives the user. Other reasoning models include OpenAI’s o1 (based on GPT-4o) and o3, Google’s Gemini Flash 2.0 Thinking (based on Gemini Flash) and Alibaba’s open QwQ (“Qwen with Questions”), based on its Qwen2.5 model.

The intuition behind reasoning models comes from early research demonstrating that simply adding the phrase “think step by step” significantly improves model outputs. Subsequent research from Google DeepMind theorized that scaling up test-time compute (the amount of resources used to generate an output) could enhance model performance as much as scaling up train-time compute (the resources used to train a model).

Though reasoning models are slower and more expensive—you still must generate (and pay for) all of the tokens used to “think” about the final response, and those tokens eat into your available context window—they have pushed the vanguard of state-of-the-art performance since OpenAI’s release of o1. Most notably, the emphasis on training models to prioritize planning and forethought has made them adept at certain tasks involving complex math and reasoning problems previously inaccessible to LLMs.

For more on reasoning models, check out this excellent visual guide from Maarten Grootendorst.

Why is DeepSeek-R1 Important?

DeepSeek-R1’s performance rivals that of leading models, including OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet, on math, code and reasoning tasks. Regardless of which model is “best”—which is subjective and situation-specific—it’s a remarkable feat for an open model. But the most important aspects of R1 are the training techniques that it introduced to the open source community.

Typically, the process of taking a standard LLM from untrained to ready for end users is as follows:

  1. Pretraining: The model learns linguistic patterns through self-supervised learning.
  2. Supervised fine-tuning (SFT): The model learns how to apply those linguistic patterns from labeled examples.
  3. Reinforcement learning (RL): The model is guided toward more specific, abstract considerations. For standard chat-oriented models, this step usually entails reinforcement learning from human feedback (RLHF) to make responses more helpful and harmless. For reasoning models, RL is used to incentivize a deeper, longer “thought process.”

For proprietary reasoning models such as o1, the specific details of this final step are typically a closely guarded trade secret. But DeepSeek has released a technical paper detailing their process.

How Does DeepSeek-R1 Work?

In their first attempt to turn DeepSeek-V3 into a reasoning model, DeepSeek skipped SFT and went directly from pretraining to a simple reinforcement learning scheme:

  • Model query: Ask the model a question. Prompt it to output its thought process between “<think>” and “</think>,” and output its final answer between “<answer>” and “</answer>.”
  • Accuracy rewards: Reward the model for the quality of its answer (such as how well-generated code runs).
  • Format rewards: Reward the model for correctly using the “<think>” and “<answer>” format in responses.

The resulting model (which they released as “DeepSeek-R1-Zero”) learned to generate complex chains of thought and employ reasoning strategies that yielded impressive performance on math and reasoning tasks. The process was straightforward and avoided costly labeled data for SFT. Unfortunately, as the technical paper explains, “DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability and language mixing.”

To train R1-Zero’s successor, DeepSeek-R1, DeepSeek amended the process:

  • Started with some conventional SFT to avoid a “cold start”
  • Used R1-Zero style reinforcement learning, with an additional reward term to avoid language mixing
  • Used the resulting RL-tuned model (and the base DeepSeek-V3 model) to generate 800,000 more SFT examples
  • Added More SFT
  • Added More R1-Zero style reinforcement learning
  • Used Conventional reinforcement learning from human feedback (RLHF)

But that fine-tuning process is only half of the story. The other half is the base model for R1: DeepSeek-V3.

What is DeepSeek-V3?

DeepSeek-V3, the backbone of DeepSeek-R1, is a text-only, 671 billion (671B) parameter mixture of experts (MoE) language model. Particularly for math, reasoning and coding tasks, it’s arguably the most capable open source LLM available as of February 2025. More importantly, it’s significantly faster and cheaper to use than other leading LLMs.

671 billion parameters means it’s a huge model. For context, when Meta released Llama 3.1 405B—which is 40% smaller than DeepSeek-V3—in July 2024, their official announcement described it as “the world’s largest and most capable openly available foundation model.” The original ChatGPT model, GPT-3.5, had 175 billion parameters. It’s worth noting that most major developers, including OpenAI, Anthropic and Google, don’t disclose the parameter count of their proprietary models.

A larger parameter count typically increases a model’s “capacity” for knowledge and complexity. More parameters mean more ways to adjust the model, which means a greater ability to fit the nooks and crannies of training data. But increasing a model’s parameter count also increases computational requirements, making it slower and more expensive.

So how is DeepSeek-V3 (and therefore DeepSeek-R1) fast and cheap? The answer lies primarily in the mixture of experts architecture and how DeepSeek modified it.

What is a Mixture of Experts (MoE) Model?

A mixture of experts (MoE) architecture divides the layers of a neural network into separate sub-networks (or expert networks) and adds a gating network that routes tokens to select “experts.” During training, each “expert” eventually becomes specialized for a specific type of token—for instance, one expert might learn to specialize in punctuation while another handles prepositions—and the gating network learns to route each token to the most appropriate expert(s).

Rather than activating every model parameter for each token, an MoE model activates only the “experts” best suited to that token. DeepSeek-V3 has a total parameter count of 671 billion, but it has an active parameter count of only 37 billion. In other words, it only uses 37 billion of its 671 billion parameters for each token it reads or outputs.

Done well, this MoE approach balances the capacity of its total parameter count with the efficiency of its active parameter count. Broadly speaking, this explains how DeepSeek-V3 offers both the capabilities of a massive model and the speed of a smaller one.

MoEs got a lot of attention when Mistral AI released Mixtral 8x7B in late 2023, and GPT-4 was rumored to be an MoE. While some model providers—notably IBM® Granite™, Databricks, Mistral and DeepSeek—have continued work on MoE models since then, many continue to focus on traditional “dense” models.

So if they’re so great, why aren’t MoEs more ubiquitous? There are 2 simple explanations:

  1. Because MoEs are more complex, they’re also more challenging to train and fine-tune.
  2. While the MoE architecture reduces computation costs, it does not reduce memory costs: though not every parameter will be activated at once, you still need to store all of those parameters in memory in case they’re activated for a given token. Therefore, MoEs require just as much RAM as dense models of the same size, which remains a major bottleneck.

How is DeepSeek’s MoE Unique?

DeepSeek-V3 features a number of clever engineering modifications to the basic MoE architecture that increase its stability while decreasing its memory usage and further reducing its computation requirements. Some of these modifications were introduced in its predecessor, DeepSeek-V2, in May 2024. Here are 3 notable innovations:

Multi-head Latent Attention (MLA)

The attention mechanism that powers LLMs entails a massive number of matrix multiplications (often shortened to “matmul” in diagrams) to compute how each token relates to the others. All of those intermediate calculations must be stored in memory as things move from input to final output.

Multi-head latent attention (MLA), first introduced in DeepSeek-V2, “decomposes” each matrix into 2 smaller matrices. This doubles the number of multiplications, but greatly reduces the size of all that stuff you need to store in memory. In other words, it lowers memory costs (while increasing computational costs)—which is great for MoEs, since they already have low computational costs (but high memory costs).

Training in FP8 (floating point 8-bit)

In short: the specific values of each parameter in DeepSeek-V3 are represented with fewer decimal points than usual. This reduces precision, but increases speed and further reduces memory usage. Usually, models are trained at a higher precision—often 16-bit or 32-bit—then quantized down to FP8 afterward.

Multi-token prediction (MTP)

Multi-token prediction is what it sounds like: instead of predicting only one token a time, the model preemptively predicts some of the next tokens too—which is easier said than done.

Was DeepSeek-R1 Made for Only USD 5.5 Million?

No. Technically, DeepSeek reportedly spent about USD 5.576 million on the final pre-training run for DeepSeek-V3. However, that number has been taken dramatically out of context.

DeepSeek has not announced how much it spent on data and compute to yield DeepSeek-R1. The widely reported “USD 6 million” figure is specifically for DeepSeek-V3.

Furthermore, citing only the final pretraining run cost is misleading. As IBM’s Kate Soule, Director of Technical Product Management for Granite, put it in an episode of the Mixture of Experts Podcast: “That’s like saying if I’m gonna run a marathon, the only distance I’ll run is [that] 26.2 miles. The reality is, you’re gonna train for months, practicing, running hundreds or thousands of miles, leading up to that 1 race.”

Even the DeepSeek-V3 paper makes it clear that USD 5.576 million is only an estimate of how much the final training run would cost in terms of average rental prices for NVIDIA H800 GPUs. It excludes all prior research, experimentation and data costs. It also excludes their actual training infrastructure—one report from SemiAnalysis estimates that DeepSeek has invested over USD 500 million in GPUs since 2023—as well as employee salaries, facilities and other typical business expenses.

To be clear, spending only USD 5.576 million on a pretraining run for a model of that size and ability is still impressive. For comparison, the same SemiAnalysis report posits that Anthropic’s Claude 3.5 Sonnet—another contender for the world’s strongest LLM (as of early 2025)—cost tens of millions of USD to pretrain. That same design efficiency also enables DeepSeek-V3 to be operated at significantly lower costs (and latency) than its competition.

But the notion that we have arrived at a drastic paradigm shift, or that western AI developers spent billions of dollars for no reason and new frontier models can now be developed for low 7-figure all-in costs, is misguided.

DeepSeek-R1-Distill Models

DeepSeek-R1 is impressive, but it’s ultimately a version of DeepSeek-V3, which is a huge model. Despite its efficiency, for many use cases it’s still too large and RAM-intensive.

Rather than developing smaller versions of DeepSeek-V3 and then fine-tuning those models, DeepSeek took a more direct and replicable approach: using knowledge distillation on smaller open source models from the Qwen and Llama model families to make them behave like DeepSeek-R1. They called these models “DeepSeek-R1-Distill.”

Knowledge distillation, in essence, is an abstract form of model compression. Rather than just training a model directly on training data, knowledge distillation trains a “student model” to emulate the way a larger “teacher model” processes that training data. The student model’s parameters are adjusted to produce not only the same final outputs as the teacher model, but also the same thought process—the intermediate calculations, predictions or chain-of-thought steps—as the teacher.

Despite their names, the “DeepSeek-R1-Distill” models are not actually DeepSeek-R1. They are versions of Llama and Qwen models fine-tuned to act like DeepSeek-R1. While the R1-distills are impressive for their size, they don’t match the “real” DeepSeek-R1.

So if a given platform claims to offer or use “R1,” it’s wise to confirm which “R1” they’re talking about.

Misleading Reporting About DeepSeek

Between the unparalleled public interest and unfamiliar technical details, the hype around DeepSeek and its models has at times resulted in the significant misrepresentation of some basic facts.

For example, early February featured a swarm of stories about how a team from UC Berkeley apparently “re-created” or “replicated” DeepSeek-R1 for only USD 30. That’s a deeply intriguing headline with incredible implications if true—but it’s fundamentally inaccurate in multiple ways:

  • The Berkeley team didn’t re-create R1’s fine-tuning technique. They replicated R1-Zero’s RL-only fine-tuning technique per the guidelines in DeepSeek’s technical paper.
  • The Berkeley team didn’t fine-tune DeepSeek-V3, the 671B parameter model that serves as the backbone of DeepSeek-R1 (and DeepSeek-R1-Zero). Instead, they fine-tuned small, open source Qwen2.5 models (and found success with the 1.5B, 3B and 7B variants). Naturally, it’s much cheaper to fine-tune a 1.5B parameter model than a 671B parameter model, given that the former is literally hundreds of times smaller.
  • They only tested their miniature R1-Zero-inspired models’ performance on a single math-specific task. As engineer Jiaya Pan clarified, their experiment didn’t touch upon code or general reasoning.

In short, the UC Berkeley team did not re-create DeepSeek-R1 for USD 30. They simply showed that DeepSeek’s experimental, reinforcement learning-only fine-tuning approach, R1-Zero, can be used to teach small models to solve intricate math problems. Their work is interesting, impressive and important. But without a fairly detailed understanding of DeepSeek’s model offerings—which many busy readers (and writers) don’t have time for—it’s easy to get the wrong idea.

What Might Be Next for DeepSeek?

As developers and analysts spend more time with these models, the hype will probably settle down a bit. Much in the same way that an IQ test alone is not an adequate way to hire employees, raw benchmark results are not enough to determine whether any model is the “best” for your specific use case. Models, like people, have intangible strengths and weaknesses that take time to understand.

It will take a while to determine the long-term efficacy and practicality of these new DeepSeek models in a formal setting. As WIRED reported in January, DeepSeek-R1 has performed poorly in security and jailbreaking tests. These concerns will likely need to be addressed to make R1 or V3 safe for most enterprise use.

Meanwhile, new models will arrive and continue to push the state of the art. Consider that GPT-4o and Claude 3.5 Sonnet, the leading closed-source models against which DeepSeek’s models are being compared, were first released last summer: a lifetime ago in generative AI terms. Following the release of R1, Alibaba announced the impending release of their own massive open source MoE model, Qwen2.5-Max, that they claim beats DeepSeek-V3 across the board. More providers will likely follow suit.

Most importantly, the industry and open source community will experiment with the exciting new ideas that DeepSeek has brought to the table, integrating or adapting them for new models and techniques. The beauty of open source innovation is that a rising tide lifts all boats.

FAQ: Answering Your Questions

1. What is DeepSeek and where is it based?

DeepSeek is an AI research company based in Hangzhou, China, focused on developing open-weight generative AI models.

2. What is DeepSeek-R1 and what makes it special?

DeepSeek-R1 is a reasoning model fine-tuned from the DeepSeek-V3 LLM, designed to generate detailed, step-by-step thought processes before providing an answer. It rivals top proprietary models in math, coding, and reasoning tasks.

3. What is DeepSeek-V3 and why is it so efficient?

DeepSeek-V3 is a 671 billion parameter mixture of experts (MoE) model. Its efficiency comes from activating only 37 billion parameters per token, balancing capacity with speed.

4. Is it true that DeepSeek-R1 only cost $5.5 million to develop?

The $5.5 million figure refers to the estimated cost of the final pre-training run for DeepSeek-V3 and doesn’t include other significant costs like research, experimentation, data acquisition, and infrastructure.

5. What are DeepSeek-R1-Distill models and how do they relate to DeepSeek-R1?

DeepSeek-R1-Distill models are smaller models from the Llama and Qwen families fine-tuned to mimic the behavior of DeepSeek-R1 through knowledge distillation. They are not the same as the original DeepSeek-R1.

6. How does DeepSeek’s Mixture of Experts (MoE) architecture work?

DeepSeek’s MoE architecture divides the neural network into specialized sub-networks (“experts”) and uses a gating network to route tokens to the most appropriate experts, activating only a fraction of the total parameters for each token.

7. What are some potential future developments for DeepSeek?

Future developments may include addressing security concerns, integrating DeepSeek’s innovations into new models, and competing with emerging models from other providers like Alibaba’s Qwen2.5-Max.

Conclusion

DeepSeek’s emergence as a significant player in the AI landscape is undeniable. With its innovative models like DeepSeek-R1 and DeepSeek-V3, the company is pushing the boundaries of what’s possible with open-weight AI. By focusing on efficiency and reasoning capabilities, DeepSeek is offering a compelling alternative to established models, fostering competition and driving progress in the field.

Are you ready to dive deeper into the world of AI and machine learning? Melsoft Academy offers comprehensive courses designed to equip you with the skills and knowledge needed to succeed in this rapidly evolving industry. Apply now and take the first step towards a rewarding career in AI!

Leave a Comment

Your email address will not be published. Required fields are marked *