• Home
  • News
  • Comprehensive Evaluation of Generative 3D Large Models in the AI Era
Comprehensive Evaluation of Generative 3D Large Models in the AI Era
By OpenTaskAI profile image OpenTaskAI
10 min read

Comprehensive Evaluation of Generative 3D Large Models in the AI Era

Author: '数字生命卡兹克' In all of my previous articles, I have always classified AI into four modalities: AI Text (large language models), AI Drawing, AI Sound, AI Video However, in my recent communications and interviews, there has been a presence that exists outside these four modalities, which has been repeatedly mentioned.

Author: '数字生命卡兹克'

In all of my previous articles, I have always classified AI into four modalities:

AI Text (large language models), AI Drawing, AI Sound, AI Video

However, in my recent communications and interviews, there has been a presence that exists outside these four modalities, which has been repeatedly mentioned.

AI 3D.

On the evening of December 20th, this Wednesday, I was happily chatting for an hour during an interview with a friend. At the end, he suddenly asked a question that was not on the agenda: "What do you think about 3D in the AI era?"

Honestly, I was a bit stunned at the time. I had never seriously thought about this question and just brushed it off with a bit of my own understanding.

But this was not the first person to discuss this topic with me. In the past month, AI 3D has been brought up numerous times in my various information channels.

Therefore, I have also decided to write this article, to talk about what I consider the fifth modality: AI 3D, as well as the current state of this field.

Without further ado, let's begin.

Currently, there are about five mainstream players in the AI 3D field: Tripo, Meshy, sudoAI, CSM, LumaAI.

CSM and Luma are quite established companies. Luma previously specialized in real-scene scanning, which I have always been using. Recently, they launched a product called Genie in 3D for WenSheng (text-to-3D), which is currently hosted on Discord and does not support image-to-3D. CSM developed a real-time drawing-to-3D feature, but it does not support text-to-3D.

Meshy has been in the game for a while too; I remember they released their product around July or August. Tripo and sudoAI are relatively newer, especially Tripo, which was just released on December 21st.

Discussing AI 3D products inevitably brings us to the core functionalities and pain points, which naturally involve modeling.

I'll briefly describe the workflow of 3D to give everyone an idea. It generally goes like this: conceptual design - 3D modeling - texture mapping - skeleton binding - animation production - lighting - rendering - compositing.

Those visual effects in movies or the scenes in video games you see all require modeling, texturing, and rendering. The initial outcome of modeling is a basic model, which roughly looks like this.

After obtaining the model, only then can all subsequent tasks be accomplished.

Therefore, modeling is extremely important, but at the same time, it is the most time-consuming process, often taking up 30% to 50% of the total time. In the 3D field, there is nothing more important, more tedious, and more in need of AI optimization than modeling.

The products of these companies in AI-generated modeling are quite similar, offering both text-to-3D and image-to-3D  functionalities.

Text-to-3D and image-to-3D are actually very easy to understand. It's similar to the concept in AI Video, where text or images are used to generate a 4-second clip. In AI 3D, however, they are used to generate a model.

So, the criteria for evaluating these products are quite straightforward:
the quality and accuracy of the generated models.

Normally, image-to-3D is what we use the most.

So, I first ran an image through MJ V6:

Basketball game assets, Blender 3D model, obj fbx glb 3D model, default pose, PNG image with transparent background.

Then, I submitted this image to Tripo, Meshy, sudo, and CSM. Since Luma currently does not support image-to-3D, it was excluded from this image-to-3D comparison.

To be honest, my expectations for AI 3D weren't very high, which is why I initially chose something as simple as a basketball.

I downloaded all the models and rendered them into animated GIFs in Blender, keeping all camera settings, HDR, and parameters uniform. This allows everyone to directly compare the output from these four products.

As can be seen, only Tripo successfully merged the textures of the basketball, creating a realistic basketball. The textures in Meshy and sudo are obviously distorted, and the distortion is not a minor one that can be overlooked; it's a complete breakdown that renders them unusable. CSM's output also turned into a mess at the back.

Let's take a closer look at the modeling details in Blender.

CSM managed to create a slight shadow in the grooves of the basketball, whereas the models from Tripo and sudo are just not-so-round balls with some flaws, but still usable. Meshy's model, on the other hand, is completely unusable due to its breakdown.

In the case of the basketball, Tripo is far ahead.

The ranking would be: Tripo > CSM > sudo > Meshy.

Let's try a few more examples.

1. Cartoon Little Dragon.

Tripo continues to perform steadily. Meshy's model has a bunch of holes... sudo's texture is okay, but the modeling of the lower half and the structure of the tail at the back are completely distorted. CSM's model, which had two faces at one turn, gave me quite a scare, but the structural integrity of the model was still acceptable...

The ranking for the Cartoon Little Dragon would be: Tripo > CSM > sudo > Meshy

2. Sweater. After all, creating clothing is an inevitable part of the modeling process...

Tripo's performance is nearly perfect, both in modeling and texturing. If you really want to nitpick, the only issue might be the lack of two holes at the cuffs (laughs). Meshy's modeling, as usual, has holes, and I've noticed a major issue with their texturing: it's always refined on the front but somewhat broken at the back. sudo's clothing model still has holes on the sides and unwanted connections. CSM has the same issue as Meshy with texturing; there's a huge difference between the front and the back.

The ranking for the sweater would be: Tripo > CSM > sudo > Meshy

3. A rose. Modeling flowers is one of the most challenging tasks, and it's basically at the highest difficulty level for current AI 3D. Let's use a rose to wrap up the image-to-3D comparison.

With Tripo, the front and back of the flower model are structurally sound, but there's a glitch with the leaves, creating some strange excess parts. Meshy, as usual, is all about appearances; the front looks impressive, but once turned around, it's full of holes again. The details on sudo's flower are distorted, almost obliterating the structure of the flower.

As for CSM... really, don't ask me what that lump is supposed to be. I don't know either, but I'm sure it's not a flower.

From these four examples, at least in the area of image-to-3D, Tripo is significantly ahead.

Overall ranking: Tripo > sudo > CSM = Meshy.

Let's now look at text-to-3D. CSM does not support text-to-3D, but LumaAI's Genie does, so this comparison will be between Tripo, Meshy, sudoAI, and LumaAI.

Text-to-3D really depends on the fundamental capabilities of the model itself. After all, with image-to-3D, the image comes from someone else, so it's more about the large model's capacity for inclusivity or general utility. If your image-to-3D is not good, you can argue that the style of the image generated by MJ doesn't match your 3D large model, hence the poor results. But with text-to-3D, it's all about your own foundation; it's all in-house, so if it's not good, then it's really a problem.

The process of text-to-3D is somewhat similar to Runway's text-to-video. Runway provides four initial frames after a prompt, and you choose which image to generate the rest of the video.

For text-to-3D, it first takes about ten seconds to generate four rough preview models based on your prompt, and you can decide which one to use for further refinement. It looks something like this.

The preliminary preview models are quite rough, but they allow you to choose the general style you want.

I'll start with the first prompt:

"Spiderman dressed in Christmas style with a Christmas hat, highest quality".

Both Tripo and Luma produced very good results. Tripo tends to be more realistic, while Luma leans towards a cartoonish style. The only flaw in Luma's model is the appearance of two inexplicable white patches on the knees. Meshy turned it into something resembling a gourd... sudo's texture accuracy is not quite up to par, and there's a bug at the junction of the hat.

The ranking for the Spiderman prompt would be: Tripo > Luma > sudo > Meshy.

Now, let's try another one, a catgirl:

"An anime catgirl."

Tripo and Luma are still as reliable as ever. Meshy's result is a bit eerie, as the texture feels completely flat, like paper... sudo ended up creating something resembling a pillow...

The ranking for the anime catgirl prompt would be: Tripo > Luma > Meshy > sudo.

For the final case, let's create a 3D asset for a game, a golden pistol:

"Golden pistol, Unreal Engine, highest quality."

I won't comment on the specific details of the pistol; you can see for yourself. Both Luma and Tripo are strong contenders, but Luma has a bit more finesse in the details of the muzzle.

Luma > Tripo > Meshy > sudo

In text-to-3D, overall, Tripo and Luma are significantly ahead, with Tripo sometimes outperforming Luma in certain details.

In the combined fields of image-to-3D and text-to-3D, Tripo is currently the undisputed leader.

However, both Tripo and Luma still have quite a few flaws, such as messy wireframes, a high likelihood of facial texture distortions, and less than perfect rendering of metallic materials, among others.

But I believe time will solve everything. For a product like Tripo, which is only three days old in its first generation, you can't expect it to be perfect right away, not to mention that the field of AI 3D is just beginning to evolve.

Currently, AI 3D, led by Tripo and Luma, is roughly equivalent to the AI drawing level of Midjourney V2 or V3, while other companies are still at the V1 level.

The major breakthrough for Midjourney came with V4, which started to revolutionize the entire industry, and its recent V6 has been a game-changer.

AI 3D is now on the eve of the GPT moment.

The day of the explosion might come faster than any of us can imagine.

 In Conclusion 

In 2019, I created a 3D piece to commemorate the departure of a gaming companion.

I spent a whole month of evenings and weekends creating that image. 90% of the models were handcrafted by me, and the workload was extremely, extremely painful. Modeling alone took up 70% of my time. If I had to do it again, I definitely wouldn't, as I don't want to experience that kind of torment ever again.

Do you know how many things need to be modeled in games and films? 

Take "Elden Ring" for example, it features hundreds of bosses and countless scenes, each with numerous 3D assets ranging from bosses and castles to weapons, armor, candles, and tables.

Even with From Software's top-tier production capabilities and industrial standards, it took a full five years to develop.

"Divinity: Baldur's Gate 3" was developed by Larian Studios at their peak with a 400-person team over six years. 

"The Wandering Earth 2" had a total production period of three years.

I've also discussed with many professionals in film post-production what they need AI to optimize the most, and the answer, surprisingly unanimous, is: 
3D Modeling.

I'm extremely optimistic about AI 3D, not because it's a new field, but because it can truly liberate the productivity of content creators, allowing them to focus more on creativity, preserving their creative energy. 

Modeling is just one aspect; there's also AI texture mapping, AI skeleton binding, AI motion capture, and so on.

When AI is used to reshape the entire 3D pipeline, enabling the entire process to be interconnected, the efficiency will skyrocket. 

And it's not just professionals in gaming and film who need this.

There's a bigger player where 3D assets are fundamental: the Metaverse. 

I've never seen the Metaverse as a gimmick; I firmly believe in its future, although it's still a bit too far off at the moment, because the infrastructure and production capacity can't keep up. Without an ultra-efficient AI 3D process and AI-assisted construction, it's hardly achievable.

AI 3D is the best creative engine for the Metaverse.

I always believe that in the future, 3D content will expand infinitely, allowing everyone to become a super creator, like a god creating new worlds, crafting your own Metaverse.

That day is not too far off. 

2024, we might just witness the accelerated future of AI 3D.

Reposted from the WeChat Official Account: 数字生命卡兹克.

OpenTaskAI is a global marketplace that connects AI freelancers and business needs. Our mission is to enable more people to achieve self-worth with AI tools.


By OpenTaskAI profile image OpenTaskAI
Updated on