I can't say this has been my experience, but it may be due to my workflow.
I follow a structured process: I first have the model fetch the ADO story and write it to `story.md`. The story already includes Given-When-Then scenarios, which provide a solid foundation. It then reads the story, investigates the relevant code, and documents its findings in `investigation.md`. I review this investigation to confirm it is on the right track. Next, I ask it to create a multi-step implementation plan with human-verifiable outcomes, which it writes to `plan.md`. Only then do I have it begin executing the plan step-by-step, reviewing each step myself. This is not a massive "go do it all and I hope it’s right at the end" approach.
This method keeps the model firmly anchored to the task and prevents it from wandering off course. I am not certain how efficient the process is overall, but I suspect I could use a less capable model for the actual implementation work, since the plan already spells out exactly what needs to be done.
The key benefit is that I avoid working continuously within the same context window or conversation history. By writing everything to Markdown files, I can easily resume with a completely fresh session later.
This is just something I personally came up with. No clue if I’m "doing it wrong."