One of the fun and challenging pieces of working in custom app development is bug hunting. These "bugs" as they're called are when programs don't run as expected. The past three weeks there has been a string of failures that was inconsistent.
It seemed to affect one user in particular but not on all of their streams. This is a programmer's nightmare scenario where an action fails only some of the time. Making things even worse, simply re-running the script to generate the stream recap video would work.
In these cases logs are your best asset to sort out the problem. Just a reminder, m3u8 files are playlist files of 10 second little video clips and this is how Twitch stores stream VODs. These have been the source of another but separate problem.
Pulling more and more of the pieces together now. Twitch changed the way they're doing .m3u8 files. This change includes as many as 5x duplicates of the pieces in it which is why I was seeing HUGE file sizes in the beginning.— Make Echoes (@MakeEchoes) September 2, 2020
It looked like a few missing frames from some of the last clips when the video was being reconstructed. With that knowledge and paired with re-running it the stream capture would work, I have to pin this one on Twitch. Specifically, I believe that the video at full-resolution hasn't finished being stored on Twitch's VOD storage servers.
So how does a programmer solve a problem like this? You build an automated kicking machine.
When the video construction fails, I put the worker performing that task in a timeout for 10 minutes. After the timeout finishes, that worker does three tasks. First, removes the previously collected local data for that stream. Next, it attempts the download again. Finally, it does the reconstruction again. If this second build fails, the system quits trying and I'll manually kick it the next time I check the back end of the system.
While this method does double my bandwidth inbound on these streams, I'm nowhere near my allotment at my host. As it only is affecting 0 to 2 streams each day, it's an acceptable resource to result allocation.
I've made the deployment of this solution earlier this morning. Here's hoping that those affected will begin to see the quick response times again.
Have a great week and hoping, hoping to get a couple new features built and deployed next week.