That's very impressive in the way that the cadence and tone blend between words ("not you" crossing into "naw-chew", etc.). A few subtle unhuman inflections around big pauses, but a big step forward.
Between this, text generation, image/video synthesis, and 3d models, I bet we're less than 20...