crispigt.

Burst-jobbing the wave sampler

2026-05-14
C#UnityBurstJobsPerformanceBuoyancy

The "previous post" ended on a sad number, adaptive clipping with four refinement samples dropped the bunny from comfortable triple-digit fps to 50. The post also confidently asserted that the DLL wasn't the problem, because the C++ side was already fast. Profiler markers confirmed it, BuoyancyController.FixedUpdate was 76% wave sampling, 23% triangle clipping, and the actual P/Invoke into the C++ integrator was below 1%.

So all the time was spent in one place, WaveManager.SampleHeight. For each mesh vertex, every FixedUpdate, that function does a 2-iteration Newton solve over the inverse Gerstner displacement. Each Newton step evaluates three Gerstner waves, each wave costs a sin, a cos, a sqrt and a normalize. Per vertex per tick, that's roughly 18 transcendentals. On a 250-triangle bunny that's a few thousand sin/cos/sqrt calls per FixedUpdate, in plain managed C# with Mathf.Sin etc.

Burst exists for exactly this shape of code, tight numeric loop over an array, no allocations, no managed types. So I moved it.

The job

Unity.Burst compiles a marked struct to LLVM-optimised native code. Combined with the Job System's IJobParallelFor, you get auto-vectorisation and multi-core dispatch from a single attribute and a single Schedule call.

[BurstCompile(FloatPrecision = FloatPrecision.Standard, FloatMode = FloatMode.Fast)]
public struct WaveSampleJob : IJobParallelFor
{
    [ReadOnly] public NativeArray<float3> localVerts;
    public float4x4 matrix;
    public float4 waveA, waveB, waveC;
    public float t;
    public int iterations;

    [WriteOnly] public NativeArray<float3> worldVerts;
    [WriteOnly] public NativeArray<float>  heights;

    public void Execute(int i)
    {
        float3 lv = localVerts[i];
        float3 wv = math.transform(matrix, lv);
        worldVerts[i] = wv;
        heights[i] = SampleHeight(wv.x, wv.z);
    }

    // ... Gerstner Newton iteration, identical to the managed version
}

One thing with the Burst is that you can't pass it Vector3 or Mathf.Sin, it operates on float3/float4x4 from Unity.Mathematics and uses math.sin/math.sincos/math.sqrt, which are intrinsic-aware and vectorise cleanly. The body of the Newton loop is a one-to-one port of the managed version. No algorithmic change.

The world-transform got folded into the same job for free. The managed version did it in a separate loop pass, here it's one math.transform per element before the height sample, no extra dispatch.

Wiring it into the controller

The existing BuoyancyController consumed managed arrays (Vector3[] worldVerts, float[] vertexHeights) downstream, the DLL marshal, the AdaptiveClipper, the gizmo. Rather than refactor all of that to NativeArray, I kept the managed arrays as the public-ish interface and used the job purely as an accelerator:

job.Schedule(localVertsNA.Length, 32).Complete();

for (int i = 0; i < localVerts.Length; i++)
{
    float3 wv = worldVertsNA[i];
    worldVerts[i] = new Vector3(wv.x, wv.y, wv.z);
    vertexHeights[i] = heightsNA[i];
}

Schedule(length, batchSize) cuts the work into chunks of 32 elements per worker. .Complete() is a synchronous wait, BuoyancyController.FixedUpdate needs the heights immediately for the rest of the pipeline, so there's no point trying to overlap with anything. The copy back is contiguous POD memory.

NativeArray allocation and disposal happen in Start/OnDestroy with Allocator.Persistent. They get filled with localVerts once and never reallocated.

The numbers

Before, around 50 fps with adaptive clipping on, refineSamples=4, on the bunny. After, 300–500 fps, same scene, same settings. 6–10× wall-clock improvement on the same code path.

That ratio is roughly what you'd expect from Burst alone (5×–10× on this shape of code, mostly from SIMD packing of trig and from skipping the managed call overhead) plus a small multi-core kicker from IJobParallelFor.

What's still managed

One thing that didn't get the Burst treatment yet, The adaptive clipper's interior SampleHeight calls. I figured this was enough. Burst was the entire performance story this session. One job, one attribute, one schedule call. The lesson, profile first, optimise the hot loop, leave everything else alone.