For the script generation, did someone tried to analyze the "motion vectors" that are embedded in most video format ?
In some scene, you can clearly see the motion vectors around the guy's dick going "Up, Up, Up, Down, Down, Down, Up, Up, ...".
In theory, you could have a perfectly synchronized motion for those scenes just by letting the user select an area and have the application average the motion vectors inside the selected area. With more analysis, the application might also be able to detect an 'insertion ratio' (ex. only top 60% of the area had mouvement).
The best part is that, for VR videos, the point of view doesnt change often so you could get a 'rough draft' for a whole video just by selecting an area for each position.
It might take some work to extract the motion vectors (ffmpeg libraries could be used) but I think it would be worth it. It would simplify the script generation.
Anyway, just an idea.
Command line to see in action (keyboard shortcut: 'right arrow' = skip ahead a few seconds, 's' = skip frame by frame):
"...\ffmpeg\bin\ffplay" -flags2 +export_mvs "...\christoymack1gb.mp4" -vf codecview=mv=pf+bf+bb -ss 04:04