How Unity’s ECS expands your optimization space

Think of optimization space as the amount of options that you have in optimizing your game. Unity’s ECS expands this by a significant lot. I imagine that the optimization space of the current programming paradigm (OOP and MonoBehaviour) is like my room. Unity’s ECS is like a convention center. Let’s call it uECS from now on. I’m referring to pure ECS here, by the way.

Here are the optimization options in the current paradigm:

  • Algorithms – Some algorithms are just better than others
  • Multithreading – Modern CPUs have multiple cores. You improve performance by distributing computation in multiple threads that executes in parallel.
  • Unity settings – Some settings leads to better execution
  • Art Assets – Big textures or models with high polygon count affects performance
  • Game design – The game design can be changed such not too many entities are required for the game to work.

uECS has all these options plus more:

  • Memory coherence – Basically more cache hits. The current paradigm also have this option but it’s harder to execute due to the nature of how memory is allocated in OOP. uECS is all about memory coherence.
  • Jobs System – Makes writing multithreaded code easier.
  • Burst Compiler – This is the one that extremely expands the optimization space. Somehow, devs at Unity found a way to significantly improve the execution of C# code when only a subset of it is used.

The path to high performance code

I reserve 30 minutes to an hour every day to study another technology. For the past months, I’ve been using this time to study uECS. At first, my goal was not to write fast code. It was just to learn the ECS way because it’s really new and it’s so different from OOP. I want to practice it. I didn’t even think about learning the Jobs System and Burst Compiler. Let me show you an example of a system that was written as ComponentSystem, then to a JobComponentSystem, then using the Burst compiler.

My current side project is a sprite manager that can render non static sprites in a single draw call. It’s kind of like the code that I used here but in pure ECS. The single draw call is possible by maintaining just one mesh. Every frame, we manually transform the vertices of each quad using the entity’s transform matrix. Majority of the work is in this transformation. Each sprite has 4 vertices. Imagine if you have a thousand sprites. You would need 4000 vector to matrix multiplications.

Here’s the first iteration of the system that handles these transformations:

public class SpriteManagerTransformSystem : ComponentSystem {

    private struct Data {
        public readonly int Length;

        public ComponentDataArray<Sprite> Sprite;

        [ReadOnly]
        public ComponentDataArray<LocalToWorld> LocalToWorld;

        // We only collect non static sprites so we reduce computation
        [ReadOnly]
        public SubtractiveComponent<Static> Static;
    }

    [Inject]
    private Data data;

    protected override void OnUpdate() {
        int length = this.data.Length;
        for (int i = 0; i < length; ++i) {
            Process(i);
        }
    }

    private void Process(int index) {
        Sprite sprite = this.data.Sprite[index];
        LocalToWorld transform = this.data.LocalToWorld[index];

        float4x4 matrix = transform.Value;
        sprite.transformedV1 = math.mul(matrix, new float4(sprite.v1, 1)).xyz;
        sprite.transformedV2 = math.mul(matrix, new float4(sprite.v2, 1)).xyz;
        sprite.transformedV3 = math.mul(matrix, new float4(sprite.v3, 1)).xyz;
        sprite.transformedV4 = math.mul(matrix, new float4(sprite.v4, 1)).xyz;

        this.data.Sprite[index] = sprite; // Modify the data
    }

}

This is very straightforward. We collect all sprites that has a matrix and is non static. We loop through them, each computing then storing the transformed vertices.

16KSprites

I have a demo scene where I can specify how many sprites to create. The scene also moves the sprites back and forth in random positions just to show some movement. This system can only handle around 3,000 sprites in my PC before the frame rate drops below 60fps. This is how the profiler looks like:

TransformSystemNonJob
Click here for bigger image.

You’ll see that SpriteManagerTransformSystem is running in the main thread and takes 13.36ms at that snapshot. A limit of 3,000 sprites is quite underwhelming. I decided to try the job system. Here’s the same system but now “jobified”.

public class SpriteManagerTransformSystem : JobComponentSystem {

    private struct Data {
        public readonly int Length;

        public ComponentDataArray<Sprite> Sprite;

        [ReadOnly]
        public ComponentDataArray<LocalToWorld> LocalToWorld;

        // We only collect non static sprites so we reduce computation
        [ReadOnly]
        public SubtractiveComponent<Static> Static;
    }

    [Inject]
    private Data data;

    private struct Job : IJobParallelFor {
        public Data Data;

        public void Execute(int index) {
            Sprite sprite = this.Data.Sprite[index];
            LocalToWorld transform = this.Data.LocalToWorld[index];

            float4x4 matrix = transform.Value;
            sprite.transformedV1 = math.mul(matrix, new float4(sprite.v1, 1)).xyz;
            sprite.transformedV2 = math.mul(matrix, new float4(sprite.v2, 1)).xyz;
            sprite.transformedV3 = math.mul(matrix, new float4(sprite.v3, 1)).xyz;
            sprite.transformedV4 = math.mul(matrix, new float4(sprite.v4, 1)).xyz;

            this.Data.Sprite[index] = sprite; // Modify the data
        }
    }

    protected override JobHandle OnUpdate(JobHandle inputDeps) {
        Job job = new Job() {
            Data = this.data
        };

        return job.Schedule(this.data.Length, 64, inputDeps);
    }

}

This version can handle up to 8,000 sprites before the frame rate drops below 60fps. That’s 5000 additional sprites that can be handled. It’s already quite big. The profiler looks like this:

TransformSystemAsJob
Click here for bigger image.

SpriteManagerTransformSystem took 11.98ms in the main thread but at the bottom, you can see that it’s utilizing the worker threads at 19ms each.

Here’s the part where I say “But wait… There’s more!” You only need to specify a single attribute so that your job will be compiled using the Burst compiler. It’s really just this:

[BurstCompile] // One attribute to rule them all
private struct Job : IJobParallelFor {
    ...
}

With that attribute, my demo scene can support up to 30,000 sprites. I’m using two meshes at this point because a single mesh can only support around 16,383 quads (65535 / 4). This is a really big gain! This is what the profiler looks like:

TransformSystemAsJobWithBurst
Click here for bigger image.

In this snapshot, I can no longer locate SpriteManagerTransformSystem in the main thread. It has become so small that I would have to zoom in to see it. It only took 0.04ms. You can also see that the time spent in each worker thread is greatly reduced from 19ms in the previous one to 1.7ms here. The Burst compiler is really effective in making your code run faster. It’s just unbelievable.

This also implies that I could squeeze in more systems if I wanted to. This is why I’m imagining that the optimization space grew from a bedroom into a large convention center.

Avoid Premature Optimization

Avoiding premature optimization still applies here. I would advise to not use the Burst compiler by default. While the job system is great, code can get harder to read when written this way. It has some gotchas and may lead to more errors. There are also many job types. Reading code from system to system can get overwhelming if they are using different ones.

I tried reading Unity’s ECS demo, the one with red and blue armies. I couldn’t follow it at all. Almost everything is jobified.

While coding in ECS, I always start with the non job version because I find it easier. I only begin “jobifying” some systems when I begin stress testing or scaling up the amount of entities. Always look at the profiler. Jobify only those systems that hogs the CPU the most. You don’t really need to jobify all systems.

The Burst compiler is so good that you probably only need to jobify one or two systems to achieve your performance budget. It’s perfectly fine to not use the jobs system if you’re not yet sure how they work. Treat them as optimization options. Use only when needed.

4 thoughts on “How Unity’s ECS expands your optimization space

  1. Can you elaborate on how you got your numbers? I wanted to get into ecs and do some before and after comparisons. I started by making a script to create a bunch of sprite renderers and then move them all every frame to get a baseline similar to your “dynamic position sprite” setup you describe. However I wanted to try it with just plain unity sprite renderers and none of the fancy batching code you talked about. In your post you mention getting about 3000 sprites before your framerate drops below 60fps. When I did this plain game object based test, I was surprised to see I could seemingly add 10k sprite renderers, while having them have their transform position updated every frame and achieved 60fps. This is without any fancy mesh batching stuff as you mention in your post. Unity reports it batched everything into 7 batches for me saving 9993 batches.

    Any thoughts?

    Like

    1. The system that I’m building is very different. I’m computing the transformed vertices in CPU time instead of passing it on to the GPU. That’s why it’s slower compared to using the default rendering. I do get the benefit of guaranteed 1 draw call. My problem with dynamic batching is that it becomes unpredictable when you have lots of moving objects within a scene that uses more than one texture. You get more draw calls if you have a scene with interleaved sprites that may be using different textures.

      Like

      1. Ah that is good to know. I did a follow up test where I took one of the sample ECS projects and changed it from 3d ships to 2d quads to simulate how fast that would be. I was able to get around 150k sprites currently that way. I did not mess with putting the burst compiler directive on there if it was not already so I might be able to get even more.

        Like

Leave a comment