There is often the illusion that Intel CPUs sell well and attribute them to a successful hardware company, when in fact, Intel’s domination of desktop processors is the X86 architecture, which was born in 1978.
The same illusion is found in Nvidia.
The reason why NVIDIA can monopolize the artificial intelligence training chip market, CUDA architecture is definitely one of the heroes behind the scenes.
This architecture, born in 2006, has been involved in all areas of computer computing and has almost been shaped into the shape of NVIDIA. 80% of research in aerospace, bioscience research, mechanical and fluid simulation, and energy exploration is conducted on the basis of CUDA.
In the hottest AI field, almost all large manufacturers are preparing for Plan B: Google, Amazon, Huawei, Microsoft, OpenAI, Baidu… No one wants their future in the hands of others.
The entrepreneurial service consulting agency Dealroom.co released a set of data, in this wave of generative AI heat wave, the United States has obtained 89% of the global investment and financing, and in the investment and financing of AI chips, China’s AI chip investment and financing ranks first in the world, more than twice that of the United States.
That is to say, although there are many differences in the development methods and stages of large models of Chinese and American companies, everyone is particularly consistent in controlling computing power.
Why does CUDA have this magic? **
In 2003, in order to compete with Intel, which introduced a 4-core CPU, NVIDIA began to develop a unified computing device architecture technology, or CUDA.
The original intention of CUDA was to add an easy-to-use programming interface to the GPU, so that developers did not have to learn complex shading languages or graphics processing primitives. Nvidia’s original idea was to provide game developers with an application in the field of graphics computing, which is what Huang calls “make graphics programmable.”
However, since the launch of CUDA, it has not been able to find key applications and lack important customer support. And NVIDIA also has to spend a lot of money to develop applications, maintain services, and promote and market, and by 2008 encountered a financial storm, Nvidia’s revenue fell sharply with poor graphics card sales, and the stock price once fell to only $1.50, worse than AMD’s worst time.
It wasn’t until 2012 that two Hinton students used NVIDIA’s GPUs to compete in image recognition speed called ImageNet. They used the GTX580 graphics card and trained with CUDA technology, and the results were dozens of times faster than the second place, and the accuracy was more than 10% higher than that of the second place.
It wasn’t just the ImageNet model itself that shocked the industry. This neural network, which required 14 million images and a total of 262 quadrillion floating-point operations, used only four GTX 580s in a week’s training. For reference, Google Cat used 10 million images, 16,000 CPUs, and 1,000 computers.
This competition is not only a historical turning point for AI, but also opens a breakthrough for NVIDIA. NVIDIA began to cooperate with the industry to promote the AI ecosystem, promote open source AI frameworks, and cooperate with Google, Facebook and other companies to promote the development of AI technologies such as TensorFlow.
This is equivalent to completing the second step that Huang said, “open up GPU for programmability for all kinds of things.”
When the computing power value of GPUs was discovered, big manufacturers also suddenly woke up to the fact that CUDA, which NVIDIA had iterated and paved for several years, had become a high wall that AI could not avoid.
In order to build the CUDA ecosystem, NVIDIA provides developers with a wealth of libraries and tools, such as cuDNN, cuBLAS and TensorRT, etc., which are convenient for developers to perform deep learning, linear algebra, and inference acceleration and other tasks. In addition, NVIDIA offers a complete development toolchain including CUDA compilers and optimizers, making GPU programming and performance optimization easier for developers.
At the same time, NVIDIA also works closely with many popular deep learning frameworks such as TensorFlow, PyTorch, and MXNet, providing CUDA with significant advantages in deep learning tasks.
This dedication to “help the horse and give it a ride” enabled NVIDIA to double the number of developers in the CUDA ecosystem in only two and a half years.
Over the past decade, NVIDIA has promoted CUDA’s teaching courses to more than 350 universities, with professional developers and domain experts on the platform who have provided rich support for CUDA applications by sharing experiences and answering difficult questions.
More importantly, NVIDIA knows that the defect of hardware as a moat is that there is no user stickiness, so it bundles hardware with software, GPU rendering to use CUDA, AI noise reduction to use OptiX, autonomous driving computing needs CUDA…
Although NVIDIA currently monopolizes 90% of the AI computing power market with GPU + NVlink + CUDA, there is more than one crack in the empire.
Cracks
AI manufacturers have been suffering from CUDA for a long time, and it is not alarmist.
The magic of CUDA is that it is in the key position of the combination of software and hardware, which is the cornerstone of the entire ecosystem for software, and it is difficult for competitors to bypass CUDA to be compatible with NVIDIA’s ecosystem; For hardware, the design of CUDA is basically a software abstraction in the form of NVIDIA hardware, and basically each core concept corresponds to the hardware concept of the GPU.
Then for competitors, there are only two options left:
1 Bypass CUDA and rebuild a software ecosystem, which requires facing the huge challenge of NVIDIA’s user stickiness;
2 Compatible with CUDA, but also face two problems, one is that if your hardware route is inconsistent with NVIDIA, then it is possible to achieve inefficient and uncomfortable, and the other is that CUDA will follow the evolution of NVIDIA hardware characteristics, and compatibility can only choose to follow.
But in order to get rid of Nvidia’s grip, both options have been tried.
In 2016, AMD launched ROCm, a GPU ecosystem based on open source projects, providing HIP tools that are fully compatible with CUDA, which is a way to follow the route.
However, due to the lack of toolchain library resources and the high cost of development and iteration compatibility, it is difficult for the ROCm ecosystem to grow. On Github, more than 32,600 developers contribute to the CUDA package repository, while ROCm has fewer than 600.
The difficulty of taking the NVIDIA-compatible CUDA route is that its update iteration speed can never keep up with CUDA and it is difficult to achieve full compatibility:
1 iteration is always one step slower: NVIDIA GPUs iterate quickly on microarchitectures and instruction sets, and many places in the upper software stack also have to do corresponding feature updates. But AMD can’t know NVIDIA’s product roadmap, and software updates will always be one step slower than NVIDIA. For example, AMD may have just announced support for CUDA11, but NVIDIA has already launched CUDA12.
2 Difficulty in full compatibility will increase the workload of developers: Large software such as CUDA itself is very complex, and AMD needs to invest a lot of manpower and material resources for several years or even more than a decade to catch up. Because there are inevitable functional differences, if the compatibility is not done well, it will affect the performance (although 99% are similar, but solving the remaining 1% of differences may consume 99% of the developer’s time).
There are also companies that choose to bypass CUDA, such as Modular, which was founded in January 2022.
Modular’s idea is to keep the bar as low as possible, but it’s more like a surprise attack. It proposes an AI engine “for improving the performance of artificial intelligence models” to solve the problem that “current AI application stacks are often coupled with specific hardware and software” through a “modular” approach.
To accompany this AI engine, Modular has also developed the open-source programming language Mojo. You can think of it as a programming language “built for AI”, Modular uses it to develop tools to integrate into the aforementioned AI engine, while seamlessly integrating with Python and reducing learning costs.
The problem with Modular, however, is that its vision of “all-platform development tools” is too idealistic.
Although it bears the title of “beyond Python” and is endorsed by Chris Lattner’s reputation, Mojo, as a new language, needs to be tested by many developers in terms of promotion.
AI engines face more problems, not only with agreements with numerous hardware companies, but also with compatibility between platforms. These are all tasks that require a long time of polishing to complete, and what Nvidia will evolve into at that time, I am afraid no one will know.
Challenger Huawei
On October 17, the United States updated its export control rules for AI chips, preventing companies such as NVIDIA from exporting advanced AI chips to China. According to the latest rules, NVIDIA’s chip exports to China, including A800 and H800, will be affected.
Previously, after the two models of NVIDIA A100 and H100 were restricted from exporting to China, the “castrated version” A800 and H800 exclusively for China were designed to comply with the regulations. Intel has also launched the AI chip Gaudi2 for the Chinese market. Now it seems that companies will have to adjust their response under the new round of export bans.
In August this year, the Mate60Pro equipped with Huawei’s self-developed Kirin 9000S chip suddenly went on sale, which instantly triggered a huge wave of public opinion, making another piece of news at almost the same time quickly drowned out.
Liu Qingfeng, chairman of iFLYTEK, made a rare statement at a public event, saying that Huawei’s GPU can benchmark against the NVIDIA A100, but only if Huawei sends a special working group to optimize the work of iFLYTEK.
Such sudden statements often have deep intentions, and although they do not have the ability to predict it, their utility is still to respond to the chip ban two months later.
Huawei GPU, the Ascend AI full-stack software and hardware platform, includes 5 layers, which are Atlas series hardware, heterogeneous computing architecture, AI framework, application enablement, and industry applications from the bottom up.
Basically, it can be understood that Huawei has made a set of replacements for NVIDIA, the chip layer is Ascend 910 and Ascend 310, and the heterogeneous computing architecture (CANN) benchmarks the NVIDIA CUDA + CuDNN core software layer.
Of course, the gap cannot be absent, and some relevant practitioners summarized two points:
1 The performance of single card lags behind, and there is still a gap between Ascend 910 and A100, but the victory is that the price is cheap and the amount can be stacked, and the overall gap is not large after reaching the cluster scale;
2 Ecological disadvantages do exist, but Huawei is also trying to catch up, for example, through the cooperation between the PyTorch community and Ascend, PyTorch version 2.1 has synchronously supported Ascend NPU, which means that developers can directly develop models based on Ascend on PyTorch 2.1.
At present, Huawei Ascend mainly runs Huawei’s own closed-loop large-model products, and any public model must be deeply optimized by Huawei to run on Huawei’s platform, and this part of the optimization work relies heavily on Huawei.
In the current context, Ascend has special significance.
In May this year, Zhang Dixuan, President of Huawei’s Ascend Computing Business, revealed that the “Ascend AI” basic software and hardware platform has incubated and adapted to more than 30 mainstream large models, and more than half of China’s native large models are based on the “Ascend AI” basic software and hardware platform, including the Pengcheng series, Zidong series, and HUAWEI CLOUD Pangu series. In August this year, Baidu also officially announced the adaptation of the Ascend AI with the flying oar + Wen Xin model.
And according to a picture circulating on the Internet, the Chinese Intelligent Supercomputing Center is basically Ascend except for undisclosed, and it is said that after the new round of chip restrictions, 30-40% of Huawei’s chip production capacity will be reserved for Ascend cluster, and the rest is Kirin.
Epilogue
In 2006, when NVIDIA was unfolding its grand narrative, no one thought CUDA would be a revolutionary product, and Huang had to persuade the board of directors to invest $500 million a year to gamble on an unknown payback period of more than 10 years, and NVIDIA’s revenue was only $3 billion that year.
But in all the business stories that use technology and innovation as keywords, there are always people who have achieved great success because of their persistent adherence to long-term goals, and NVIDIA and Huawei are among the best.
Resources
[1] NVIDIA’s “sickle” is not an AI chip, a silicon-based laboratory
[2] In order to become a “NVIDIA replacement”, large model manufacturers opened the book, and the small dinner table created clothes
[3] Only 1 year after its establishment, this AI star startup wants to challenge NVIDIA and magnesium kenet
[4] A crack in the Nvidia Empire, the Enukawa Research Institute
[5] The United States plans to step up chip exports to China, Huawei leads the rise of domestic production, and West China Securities
[6] AIGC Industry In-Depth Report (11): Huawei Computing Power Spin-off: The Second Pole of Global AI Computing Power, West China Securities
[7] 2023 AIGC Industry Special Report: Four major technical routes of AI chips, Cambrian Copy NVIDIA, Shenwan Hongyuan
[8] How CUDA Achieves NVIDIA: A Great Breakthrough in AI, Tencent Cloud Community
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
NVIDIA: Empire Rift One by One
Original source: Decode
There is often the illusion that Intel CPUs sell well and attribute them to a successful hardware company, when in fact, Intel’s domination of desktop processors is the X86 architecture, which was born in 1978.
The same illusion is found in Nvidia.
The reason why NVIDIA can monopolize the artificial intelligence training chip market, CUDA architecture is definitely one of the heroes behind the scenes.
This architecture, born in 2006, has been involved in all areas of computer computing and has almost been shaped into the shape of NVIDIA. 80% of research in aerospace, bioscience research, mechanical and fluid simulation, and energy exploration is conducted on the basis of CUDA.
In the hottest AI field, almost all large manufacturers are preparing for Plan B: Google, Amazon, Huawei, Microsoft, OpenAI, Baidu… No one wants their future in the hands of others.
The entrepreneurial service consulting agency Dealroom.co released a set of data, in this wave of generative AI heat wave, the United States has obtained 89% of the global investment and financing, and in the investment and financing of AI chips, China’s AI chip investment and financing ranks first in the world, more than twice that of the United States.
That is to say, although there are many differences in the development methods and stages of large models of Chinese and American companies, everyone is particularly consistent in controlling computing power.
Why does CUDA have this magic? **
In 2003, in order to compete with Intel, which introduced a 4-core CPU, NVIDIA began to develop a unified computing device architecture technology, or CUDA.
The original intention of CUDA was to add an easy-to-use programming interface to the GPU, so that developers did not have to learn complex shading languages or graphics processing primitives. Nvidia’s original idea was to provide game developers with an application in the field of graphics computing, which is what Huang calls “make graphics programmable.”
However, since the launch of CUDA, it has not been able to find key applications and lack important customer support. And NVIDIA also has to spend a lot of money to develop applications, maintain services, and promote and market, and by 2008 encountered a financial storm, Nvidia’s revenue fell sharply with poor graphics card sales, and the stock price once fell to only $1.50, worse than AMD’s worst time.
It wasn’t until 2012 that two Hinton students used NVIDIA’s GPUs to compete in image recognition speed called ImageNet. They used the GTX580 graphics card and trained with CUDA technology, and the results were dozens of times faster than the second place, and the accuracy was more than 10% higher than that of the second place.
This competition is not only a historical turning point for AI, but also opens a breakthrough for NVIDIA. NVIDIA began to cooperate with the industry to promote the AI ecosystem, promote open source AI frameworks, and cooperate with Google, Facebook and other companies to promote the development of AI technologies such as TensorFlow.
This is equivalent to completing the second step that Huang said, “open up GPU for programmability for all kinds of things.”
When the computing power value of GPUs was discovered, big manufacturers also suddenly woke up to the fact that CUDA, which NVIDIA had iterated and paved for several years, had become a high wall that AI could not avoid.
In order to build the CUDA ecosystem, NVIDIA provides developers with a wealth of libraries and tools, such as cuDNN, cuBLAS and TensorRT, etc., which are convenient for developers to perform deep learning, linear algebra, and inference acceleration and other tasks. In addition, NVIDIA offers a complete development toolchain including CUDA compilers and optimizers, making GPU programming and performance optimization easier for developers.
At the same time, NVIDIA also works closely with many popular deep learning frameworks such as TensorFlow, PyTorch, and MXNet, providing CUDA with significant advantages in deep learning tasks.
This dedication to “help the horse and give it a ride” enabled NVIDIA to double the number of developers in the CUDA ecosystem in only two and a half years.
Over the past decade, NVIDIA has promoted CUDA’s teaching courses to more than 350 universities, with professional developers and domain experts on the platform who have provided rich support for CUDA applications by sharing experiences and answering difficult questions.
More importantly, NVIDIA knows that the defect of hardware as a moat is that there is no user stickiness, so it bundles hardware with software, GPU rendering to use CUDA, AI noise reduction to use OptiX, autonomous driving computing needs CUDA…
Although NVIDIA currently monopolizes 90% of the AI computing power market with GPU + NVlink + CUDA, there is more than one crack in the empire.
Cracks
AI manufacturers have been suffering from CUDA for a long time, and it is not alarmist.
The magic of CUDA is that it is in the key position of the combination of software and hardware, which is the cornerstone of the entire ecosystem for software, and it is difficult for competitors to bypass CUDA to be compatible with NVIDIA’s ecosystem; For hardware, the design of CUDA is basically a software abstraction in the form of NVIDIA hardware, and basically each core concept corresponds to the hardware concept of the GPU.
Then for competitors, there are only two options left:
1 Bypass CUDA and rebuild a software ecosystem, which requires facing the huge challenge of NVIDIA’s user stickiness;
2 Compatible with CUDA, but also face two problems, one is that if your hardware route is inconsistent with NVIDIA, then it is possible to achieve inefficient and uncomfortable, and the other is that CUDA will follow the evolution of NVIDIA hardware characteristics, and compatibility can only choose to follow.
But in order to get rid of Nvidia’s grip, both options have been tried.
In 2016, AMD launched ROCm, a GPU ecosystem based on open source projects, providing HIP tools that are fully compatible with CUDA, which is a way to follow the route.
However, due to the lack of toolchain library resources and the high cost of development and iteration compatibility, it is difficult for the ROCm ecosystem to grow. On Github, more than 32,600 developers contribute to the CUDA package repository, while ROCm has fewer than 600.
The difficulty of taking the NVIDIA-compatible CUDA route is that its update iteration speed can never keep up with CUDA and it is difficult to achieve full compatibility:
1 iteration is always one step slower: NVIDIA GPUs iterate quickly on microarchitectures and instruction sets, and many places in the upper software stack also have to do corresponding feature updates. But AMD can’t know NVIDIA’s product roadmap, and software updates will always be one step slower than NVIDIA. For example, AMD may have just announced support for CUDA11, but NVIDIA has already launched CUDA12.
2 Difficulty in full compatibility will increase the workload of developers: Large software such as CUDA itself is very complex, and AMD needs to invest a lot of manpower and material resources for several years or even more than a decade to catch up. Because there are inevitable functional differences, if the compatibility is not done well, it will affect the performance (although 99% are similar, but solving the remaining 1% of differences may consume 99% of the developer’s time).
There are also companies that choose to bypass CUDA, such as Modular, which was founded in January 2022.
To accompany this AI engine, Modular has also developed the open-source programming language Mojo. You can think of it as a programming language “built for AI”, Modular uses it to develop tools to integrate into the aforementioned AI engine, while seamlessly integrating with Python and reducing learning costs.
The problem with Modular, however, is that its vision of “all-platform development tools” is too idealistic.
Although it bears the title of “beyond Python” and is endorsed by Chris Lattner’s reputation, Mojo, as a new language, needs to be tested by many developers in terms of promotion.
AI engines face more problems, not only with agreements with numerous hardware companies, but also with compatibility between platforms. These are all tasks that require a long time of polishing to complete, and what Nvidia will evolve into at that time, I am afraid no one will know.
Challenger Huawei
On October 17, the United States updated its export control rules for AI chips, preventing companies such as NVIDIA from exporting advanced AI chips to China. According to the latest rules, NVIDIA’s chip exports to China, including A800 and H800, will be affected.
Previously, after the two models of NVIDIA A100 and H100 were restricted from exporting to China, the “castrated version” A800 and H800 exclusively for China were designed to comply with the regulations. Intel has also launched the AI chip Gaudi2 for the Chinese market. Now it seems that companies will have to adjust their response under the new round of export bans.
In August this year, the Mate60Pro equipped with Huawei’s self-developed Kirin 9000S chip suddenly went on sale, which instantly triggered a huge wave of public opinion, making another piece of news at almost the same time quickly drowned out.
Liu Qingfeng, chairman of iFLYTEK, made a rare statement at a public event, saying that Huawei’s GPU can benchmark against the NVIDIA A100, but only if Huawei sends a special working group to optimize the work of iFLYTEK.
Such sudden statements often have deep intentions, and although they do not have the ability to predict it, their utility is still to respond to the chip ban two months later.
Huawei GPU, the Ascend AI full-stack software and hardware platform, includes 5 layers, which are Atlas series hardware, heterogeneous computing architecture, AI framework, application enablement, and industry applications from the bottom up.
Basically, it can be understood that Huawei has made a set of replacements for NVIDIA, the chip layer is Ascend 910 and Ascend 310, and the heterogeneous computing architecture (CANN) benchmarks the NVIDIA CUDA + CuDNN core software layer.
1 The performance of single card lags behind, and there is still a gap between Ascend 910 and A100, but the victory is that the price is cheap and the amount can be stacked, and the overall gap is not large after reaching the cluster scale;
2 Ecological disadvantages do exist, but Huawei is also trying to catch up, for example, through the cooperation between the PyTorch community and Ascend, PyTorch version 2.1 has synchronously supported Ascend NPU, which means that developers can directly develop models based on Ascend on PyTorch 2.1.
At present, Huawei Ascend mainly runs Huawei’s own closed-loop large-model products, and any public model must be deeply optimized by Huawei to run on Huawei’s platform, and this part of the optimization work relies heavily on Huawei.
In the current context, Ascend has special significance.
In May this year, Zhang Dixuan, President of Huawei’s Ascend Computing Business, revealed that the “Ascend AI” basic software and hardware platform has incubated and adapted to more than 30 mainstream large models, and more than half of China’s native large models are based on the “Ascend AI” basic software and hardware platform, including the Pengcheng series, Zidong series, and HUAWEI CLOUD Pangu series. In August this year, Baidu also officially announced the adaptation of the Ascend AI with the flying oar + Wen Xin model.
Epilogue
In 2006, when NVIDIA was unfolding its grand narrative, no one thought CUDA would be a revolutionary product, and Huang had to persuade the board of directors to invest $500 million a year to gamble on an unknown payback period of more than 10 years, and NVIDIA’s revenue was only $3 billion that year.
But in all the business stories that use technology and innovation as keywords, there are always people who have achieved great success because of their persistent adherence to long-term goals, and NVIDIA and Huawei are among the best.
Resources
[1] NVIDIA’s “sickle” is not an AI chip, a silicon-based laboratory
[2] In order to become a “NVIDIA replacement”, large model manufacturers opened the book, and the small dinner table created clothes
[3] Only 1 year after its establishment, this AI star startup wants to challenge NVIDIA and magnesium kenet
[4] A crack in the Nvidia Empire, the Enukawa Research Institute
[5] The United States plans to step up chip exports to China, Huawei leads the rise of domestic production, and West China Securities
[6] AIGC Industry In-Depth Report (11): Huawei Computing Power Spin-off: The Second Pole of Global AI Computing Power, West China Securities
[7] 2023 AIGC Industry Special Report: Four major technical routes of AI chips, Cambrian Copy NVIDIA, Shenwan Hongyuan
[8] How CUDA Achieves NVIDIA: A Great Breakthrough in AI, Tencent Cloud Community