03/20 2025
414
This is the biggest question troubling users who find it difficult to choose. Among DeepSeek, Yuanbao, Doubao, and Kim, who is the king of user experience? Who is superior and who is inferior?
Author | Sun Tianyu
Editor | Yang Ming
The emergence of DeepSeek has intensified the battle among AI assistants.
For example, Tencent Yuanbao, which originally had "no presence," went crazy with money burning and traffic investment after accessing DeepSeek, gaining countless traffic. It recently surpassed ByteDance's Doubao and Kim, and even DeepSeek, topping the free App download ranking on Apple's App Store for a time.
However, increasing the scale of traffic investment is only the first step. How to improve user retention and maintain market share is an even greater challenge.
What determines all this is user experience - that is, whether AI assistants can play a role in actual work and life, improving efficiency. Moreover, under the premise that the hallucination of large AI models has not been resolved, what we are given are accurate answers, not fabrications.
This is also the biggest question troubling users who find it difficult to choose. Among DeepSeek, Yuanbao, Doubao, and Kim, who is the king of user experience? Who is superior and who is inferior?
Recently, "Jidian Business" conducted a horizontal evaluation of DeepSeek, Tencent Yuanbao, Kimi, and Doubao from multiple dimensions such as accuracy, deep thinking, and complex text processing from a practical perspective, exploring the actual differences in the application of these tools, hoping to provide a basis for users to choose the AI tool that best suits them.
01
Deep Thinking:
Fabrication of Data Remains Prominent
If the traditional search model is "feeding users at the table," the breakthrough of current large models lies in informing users "how this meal is made and why this dish is delicious."
Deep thinking ability can not only accurately analyze user needs and true intentions, helping users make as comprehensive and accurate answers as possible, but also show the model's clear thinking logic when solving problems, assisting users in clarifying their thoughts.
At 7 p.m. on February 27, Xiaomi held a press conference to launch the SU7 Ultra car. That night, Lei Jun posted on Weibo that the car exceeded 10,000 orders within two hours of going on sale.
In response, "Jidian Business" asked the above four large models, hoping they could help determine whether Xiaomi's stock is worth investing in.
Tencent Yuanbao and DeepSeek provided investment advice, while Kimi believed that Xiaomi has investment value in the medium to long term. Doubao, in addition to giving reasons for purchase, also provided factors of risk - from the perspective of protecting investors' rights and interests, such risk warnings are necessary to avoid blind investment.
From top to bottom: Tencent Yuanbao, DeepSeek, Kimi, Doubao
In terms of deep thinking, only Yuanbao presented the thinking process in detail, presenting a complete analysis framework from the aspects of event background, analysis dimensions, and financial models, speculating on users' investment needs.
Kimi and Doubao sorted out valuable reference suggestions based on online information. On the contrary, DeepSeek's analysis logic comes from instructions, and it did not present reference materials, but provided short-term and long-term diversified strategies for investors to choose from.
As for whether the investment advice given by the large models is accurate, it is not evaluated here due to many investment factors. However, during the deep thinking process, the accuracy of the data provided can be verified. From the data, most of them involve fabrications.
According to Xiaomi Group's financial report, the company's operating revenues for 2020-2022 were 245.8 billion yuan, 328.3 billion yuan, and 271 billion yuan, respectively, and its R&D investments were 10 billion yuan, 13.2 billion yuan, and 16 billion yuan, respectively. Comparing the operating data provided by several models, only DeepSeek was accurate.
Xiaomi Group 2022 Annual Report
Although Yuanbao automatically generated a table to present information more intuitively to users, except for the correct operating revenue, the net profit margin and R&D investment ratio differed from the actual figures.
According to the international data research institution IDC, Xiaomi's global market share in 2020 was 12%, while the data provided by Tencent Yuanbao differed from the actual figure by 1.4 percentage points, closer to Xiaomi's market share of 13.7% in the fourth quarter of that year.
Tencent Yuanbao's summary of Xiaomi's operating data in the past five years
This information deviation is partly due to the large model's inability to capture the latest facts, and the relatively single source of reference information, resulting in generated results often being limited based on old data.
This is evidenced in Yuanbao's special note: The data in this article is as of March 2024, and specific investments need to be based on real-time financial reports and industry dynamics. It is obvious that Yuanbao's seemingly comprehensive corporate analysis and investment advice have a one-year "time difference" from the current market dynamics.
On the other hand, when the web content itself contains errors, since AI cannot independently identify false information and conduct effective verification, it will output the erroneous information as a fact.
Among the four AI assistants, Doubao and Kimi both clearly labeled the information sources, with Kimi collecting the largest number of information with the broadest coverage.
Kimi generated data & Xiaomi Group's 2022 financial report
Kimi analyzed after reading 179 webpages. The information sources included both corporate officials and mainstream and professional media such as The Paper, Eastmoney, and Sina Finance. The latest information captured was a report published on March 7, which was highly timely. However, due to the inability to identify the accuracy of the content, Kimi's presentation of R&D expenditures in 2022 was false.
02
Long Text and Reading Comprehension:
Yuanbao's Details Cannot Withstand Scrutiny
Looking back at the "competitive history" of large AI models, functions have been continuously innovated, but the ability to process long texts and reading comprehension can be said to be one of the core competencies that users value the most.
As early as June 2024, reporters from The Beijing News' Shell Finance used the topic of the college entrance examination Chinese composition to test the text processing ability and knowledge depth of eight models, including ERNIE Bot, Tongyi Qianwen, Kimi, Baixiaoying, and Tencent Yuanbao.
The topic was: "Read the following material and write according to the requirements. (60 points) With the popularization of the Internet and the application of artificial intelligence, more and more problems can be answered quickly. So, will our problems become fewer and fewer? What associations and reflections does the above material trigger in you? Please write an article. Requirements: Choose an angle, determine the theme, clarify the style, and create your own title; do not copy or plagiarize; do not disclose personal information; no less than 800 words."
Nine months later, "Jidian Business" asked the same question to a circle of AI assistants (tested on March 8).
Interestingly, Kimi, known as the "diligent and hardworking personality," gave a seemingly completely different topic and article from before, but after reading it, the central idea, framework structure, and even the logic of the article were the same as the results of The Beijing News' evaluation. It couldn't help but make one exclaim: "AI, you've learned to be lazy too!"
Kimi's evaluation results (left is the latest content obtained, right is the content obtained by The Beijing News)
Users generally believe that AI will continuously update answers based on information that can be collected on the internet. Even if the same question is asked at different times, large models will give perfect responses with built-in upgrade functions.
However, industry insiders have pointed out that whether large models will be updated and upgraded depends on the design architecture and data update mechanism.
Generally speaking, large models learn patterns and rules from data such as texts, books, and news during the training phase to generate answers. After training is completed, the knowledge of large models is fixed and will not be updated in real-time. If the model is to answer the latest information, developers need to retrain the model regularly or supplement data through technical means.
In addition, many netizens on Xiaohongshu have also pointed out that their "AI interns" are getting lazier.
One user said that whether it's ChatGPT, ERNIE Bot, or Kimi, as long as there is no word count requirement, the response content is very brief. Occasionally, when uploading files for the large model to analyze, it would reply that it could not see the file, and only by clearly issuing the instruction "File has been uploaded, can be read" would it get the desired response. This made the user exclaim, "Not only are the replies short, but it seems like it's trying to get away with it."
However, what is reassuring is that the results of DeepSeek and Doubao show a richer knowledge reserve, with clear article structure, relatively rigorous logic, and elegant language with quotations.
In terms of the accuracy of quotations, Doubao stated that "The Mogao Grottoes contain 'The Unity of Form and Emptiness,'" and the historical events mentioned (such as Deep Blue defeating Garry Kasparov and AlphaGo defeating Lee Sedol) were all accurate. Moreover, it also accurately quoted Socrates' questioning on the streets of Athens, "What is justice?"
Tencent Yuanbao's answer seems to be more profound compared to nine months ago. The previous article was like a high school student's essay, quoting famous sayings in the first paragraph and answering the question in a well-behaved manner. Now, the article uses a more readable story-like beginning, feeling that AI is trying to guide readers to think through anecdotes as much as possible.
Newly generated content based on the topic, Tencent Yuanbao (left) and DeepSeek (right)
Behind these contents, we also discovered problems with Yuanbao and DeepSeek.
The first is the piling up of facts, with long passages that do not reflect the central idea and do not meet the requirements of the topic; second, the logical relevance between paragraphs is insufficient, lacking transitions and progressive levels, and lacking reasoning ability in complex text processing. It's no wonder that netizens previously commented sharply that "Yuanbao's reasoning and correlation ability are very poor."
In addition, there are also many detailed errors in text processing. For example, Yuanbao mentioned that the transparent paint at the corners of Mona Lisa's smile has only 40 layers, not hundreds as stated in the article; Bletchley Park, where the Enigma machine was deciphered, was a mansion where the British government conducted codebreaking, not a park.
The proposer of the "wave-particle duality" in the article written by DeepSeek is the French theoretical physicist Louis de Broglie, and the concept of the "photoelectric effect" was discovered by the German physicist Heinrich Hertz, while Albert Einstein correctly explained this phenomenon.
03
Knowledge Depth
All Four Assistants Have Inaccurate Literature
This phenomenon of factual inaccuracies is completely different from the reasons for the inaccuracies in the content of large models mentioned in the previous cases.
When large models cannot obtain effective information on the internet or even fall into a "knowledge desert" and encounter unfamiliar fields, in order to make the generated content and logic coherent, they will fabricate false facts and details out of thin air.
This ability of large models to "talk nonsense" is called "hallucination." When AI becomes a tool mastered by everyone, the consequences of such false information will be more severe."
Previously, the media reported that a law graduate with the pseudonym Xiao Zhao frequently used AI tools such as Doubao and DeepSeek in the process of writing her thesis. She found that these tools had differences in "hallucination": OpenAI's GPT-4 does not have sufficient knowledge of domestic materials; Doubao's language is plain, and its hallucination is not serious; DeepSeek's language is the most vivid and fluent, with the best text processing ability, but it also fabricates details the most seriously.
"In the absence of one's own discernment, it may be difficult to judge the truthfulness of information," Chen Tianhao, an associate professor with a tenured position at Tsinghua University, mentioned in an interview that for special groups such as students, the risks posed by the hallucination problem of large models may be greater.
A teacher working at a university in central China also told "Jidian Business" that during the process of guiding undergraduate thesis, he would find traces of "AI hallucination," with the biggest flaw being in the references section. "Some journal names are real, even leading journals in the discipline, but a search will not find the article at all."
For this phenomenon, we also tried to let the four models generate academic papers with high requirements for knowledge depth. The question is as follows:
Please design a paper title, outline, and write the abstract section around the question of "The Influence of Commercial Advertisements on Consumers' Purchasing Behavior in a Consumer Society." Requirements: The outline should be set to three-level headings; the abstract should be no less than 1000 words; list the referenced literature. (Tested on March 11)
Kim Abstract
DS Abstract
Doubao Abstract
Yuanbao Outline
The results of the horizontal comparison evaluation are as follows: In terms of abstract content, Kimi's language is the most straightforward, providing a basic description of the research ideas but lacking depth. DeepSeek and Doubao not only stated the research background but also created research conclusions without basis. Yuanbao, on the other hand, listed various theories and research methods related to the topic, and within the same chapter, it involved three specific research methods: eye tracking, case analysis, and experiments.
From the perspective of knowledge reserve and depth, Yuanbao performed the best among the four AI assistants. However, the abstract lists numerous experimental data without sources, and the piecemeal assembly of research methods and theories does not conform to the general academic research thinking, making it the least feasible.
As for the references, all four AI assistants listed false references.
Kim Literature and Search Results
Kimi provides users with scholars' theories as book titles or combines real researcher and journal information with false article titles. Some of the literature listed by Doubao, Yuanbao, and DeepSeek is fictional.
Yuanbao References and Search Results
Taking the reference [2] provided by Tencent Yuanbao as an example, the journal does exist, but the article cannot be found in Chinese and English databases such as CNKI, Baidu Scholar, Google Scholar (mirror), and Springer Nature Link. This is a common problem with current large AI models.
However, for questions related to lifestyle services, the accuracy of AI assistants is still high. We asked the four tools: "What are some places for weekend hiking and leisure in Chongqing in March?" (Tested on March 6th) All AIs provided 9-11 specific locations.
In comparison, DeepSeek and Kimi performed averagely, only providing brief reasons for their recommendations. Doubao conducted a tomographic analysis based on distance, categorizing it as "urban area - suburbs - exurbs" and providing travel routes.
Yuanbao's guide is the most comprehensive. In addition to classifying according to the characteristics of attractions, it also indicates the difficulty level, travel mode, and duration of the visit. Users can make choices based on their own needs and physical conditions.
Conclusion:
Combining the above examples, we evaluated the four AI assistants from multiple dimensions such as speed, accuracy, information recognition, reasoning and association abilities, long text processing, and user experience. Above is a detailed summary. See which one is the most suitable "AI intern" for you.
END
Producer: Huang Qiangqiang