At what point does that vTuber realize they're literally just the first iteration of it? A human using an AI animated "skin", then the AI gets better, then the AI starts not needing a human to do it's talking.
Assuming that vTuber isn't already actually an AI. I will admit I couldn't take the voice enough to listen to the end of the video.
Compared to Gemini where I just handed it a pic of a friend's cat, and said "Give me a gothic style portrait of a vampire from the late 1800s but use this cat's head for it" and got this on the first try :