My concern is about performance. Here are the two extrema:
A- 2048 samples over the entire image
- pro: static tiling, arbitrary breakdown into separate independent tasks
- con: waste of samples on “easy” pixels
B- pixelwise “noise level” detection and sampling count
- pro: “nice image” (to be proven)
- con: pixel-wise tasks, not suitable for massive parallelism
While I think A’s cons are obvious, let me expand on B.
When thinking about GPU and 1000 threads running in parallel, it doesnt make sense to stop 800 threads and let just 200 “bad pixel” threads run. The 800 threads might just continue to run and sample the easy pixels.
So trading wasted easy-pixel-samples against wasted GPU-waiting-cycles - I wouldnt recommend it.
Talking about CPU rendering - totally different story. No massive parallism, purely individual pixel-treatment.
But IMO I would not concentrate too much on it. GPUs have become too cheap to ignore them.
Given enough time, I totally agree with your approach:
1- (full) pixel-wise noise evaluation (observed difference / variance)
2- (pixel-wise) smoothing
3- pixel-wise sampling
1 and 2 will eat time, going through all pixels. And then there is 3 that requires overhead for GPU devices.
Let alone the synchronized waiting for all tiles to stop, which we are already introducing by adaptivity.
In the above blend I get only about 25% speedup using that very simplistic tile nopping. TBH I expected more considering the upper third of the image is just plain black.
Plus the “reference image” was calculated with my stuff disabled, but still in the code. So I assume a totally clean build will run faster, making the speedup even lower.
Problems I see with GPUs:
- full pixelwise error estimation takes time (maybe run on CPU in parallel? consider rendering on CPU instead!)
- smoothing step (memory access penalties by accessing adjacent-pixel memory, -> tile boundaries?)
- pixel-wise sampling via x/y-redirection: memory access race conditions
- sample-count balancing when redirecting (one pixel not evaluated -> other pixel twice evaluated, to keep overall count equal)
- copy GPU memory to CPU for full-image evaluation (vs just RGBA image transfer)
So what drives me? Well, it is interactive rendering. We needed something to quickly make nice images while the user tweaks the camera. So after a few samples the environment looks “ok” but the hard pixels looked very bad. So we redirected the samples to the hard parts - good for the eye but bad for overall performance (interrupt rendering, copy memory from GPU, evaluate error, copy memory to GPU). Turns out we have a slight speedup remaining by nopping (using short-cuts: partial tile evaluation, tile-wise nopping, tile-wise sampling).
I fear that if we put more intelligence into the adaptivity, it comes with little gain but lot of penalty (‘full evaluation’ alone).
So my hope is on Lukas. If he makes Metropolis and adaptive samping “tile-wise”, then this would nicely fit together.
And it needs to be proven that all that yields a speed-up keeping the same image quality.
And then comes cinema with 10k resolution - cannot have the entire map on one GPU.
And then comes splitting up the image onto several GPUs.
Essentially generating new “tile borders”.
…making sure our current efforts are usable in the future.