Hm, I haven’t fully read the paper yet, but currently I fail to find the new contribution. It seems like the author just describes BPT and then presents his timing results.
Generally, BiDir on GPUs itself is not new. The current reference probably is http://cgg.mff.cuni.cz/~jaroslav/papers/2014-gpult/2014-gpult-paper.pdf (even including VCM, which is again a lot more advanced than BPT), but that stuff has been done for quite some years by now.
As BeerBaron said, the main problem is the memory required for subpath storage, but that can easily be traded off against speed (see Table II in the linked paper). Also, this storage of intermediate path data already applies to the current AMD Kernel Split AFAIK since it has to store all path states between kernel calls.
The way I see it (I don’t have too much insight into production workflow, correct me please), the problem with BPT isn’t speed (it’s quite fast with some tricks), number of parameters (that’s one of its advantages, along with pure path tracing, against other methods like photon mapping) or implementing it on the GPU (the way I see it, some of the advanced Cycles features are quite more complex that a simple GPU-BPT).
The main problem for the huge gap between research and production systems (I mean, come on, BPT is 22 years old) is that the more advanced an algorithm gets, the more it relies on physical laws and principles. However, these are usually seen as a limit for creative use, so production systems try to overcome this limit with tons of tricks like NPR (the name says it all), Ray Visibility, the lightpath node, layer control etc.
For Path Tracing, this is fine. PT is extremely robust, it handles these examples pretty well. BPT, however, is fundamentally based on the concept of “Helmholtz reciprocity”, which means that light rays can be reversed with no effect. This is what allows it to create paths both from the camera and the light, because in the end, the result will be the same. However, it won’t be the same when you use the flexibility that Cycles gives you. The result are weird artifacts, for example because subpaths from the light have hit an object that the camera rays didn’t hit (let’s say that the path is “Light <-> Glossy <-> Object <-> Diffuse <-> Camera” and the object has visibility for diffuse rays disabled). Another example are shaders that don’t obey energy conservation (aka “reflection higher than 100%”): PT usually gets along with this stuff, but some more modern algorithms just go berserk on your image in these cases.
The way I see it, that’s the main reason why renderers that focus on design and architecture (like VRay or Indigo) start to go into the BPT direction nowadays, while renderers that focus on VFX and movies (like Arnold or Appleseed), where you need these unphysical tricks way more, tend to stay with unidirectional methods. Cycles, according to its design documents, is focused on movie production, so that’d explain why BPT was and is no big target.
So, as a TL;DR: Adding BPT to Cycles, even on GPUs, should be possible today. However, it would probably conflict with many features that Cycles has, which would IMO raise the question why both PT and BPT have been stuffed into the same engine if the integration was done badly. So, it’s mostly design questions, not technical difficulties limiting Cycles+BPT. It is most likely possible to find a nice integration, but definitely not easy.
Again, this is of course my personal opinion and I’m not really an expert in the rendering industry, so I might be very wrong with my conclusions.