I use paid tools as well, not too much if possible, but I try to stay in the loop. Anyway, they fail miserably at anything slightly complex. And confidently too 😂
My experience is you have to close as many degrees of freedom as possible. Its tedious as hell for generating quality code.
Its great at debugging if you require it to manage its context window by delegating tasks to scoped subagents, generate evidence with references, and verify that evidence with a minimal reproducible example. Expensive… I’ve seen them run for a solid 30 minutes before responding back (not including the “thinking” log), but it usually finds the issue.
A similar technique can be used for code generation but again it burns tokens and takes awhile. Have it generate and verify isolated reference implementations for anything nontrivial. Much easier to review with the rest of your domain and layered on complexity stripped out. The “thinking” log is interesting to watch as it bangs it head against bad assumptions or documentation and needs to start digging into dependency source code to work it out.
Only then apply the implementation to your project from the reference implementation. Takes breaking down the tasks though to small enough units and closing those degrees of freedom.
Anecdote on degrees of freedom: This one didn’t require a reference implementation in particular. I was reviewing a PR (LLM assisted, I wasn’t the authoring dev) to add signature validation to OAuth tokens. It duplicated the entire header/token parsing logic. It needed that path closed with a pointer to where the existing logic was and explicit requirements to enhance it. Refactor was great upon reviewing and the PR size was reduced by more than half.
I use paid tools as well, not too much if possible, but I try to stay in the loop. Anyway, they fail miserably at anything slightly complex. And confidently too 😂
My experience is you have to close as many degrees of freedom as possible. Its tedious as hell for generating quality code.
Its great at debugging if you require it to manage its context window by delegating tasks to scoped subagents, generate evidence with references, and verify that evidence with a minimal reproducible example. Expensive… I’ve seen them run for a solid 30 minutes before responding back (not including the “thinking” log), but it usually finds the issue.
A similar technique can be used for code generation but again it burns tokens and takes awhile. Have it generate and verify isolated reference implementations for anything nontrivial. Much easier to review with the rest of your domain and layered on complexity stripped out. The “thinking” log is interesting to watch as it bangs it head against bad assumptions or documentation and needs to start digging into dependency source code to work it out.
Only then apply the implementation to your project from the reference implementation. Takes breaking down the tasks though to small enough units and closing those degrees of freedom.
Anecdote on degrees of freedom: This one didn’t require a reference implementation in particular. I was reviewing a PR (LLM assisted, I wasn’t the authoring dev) to add signature validation to OAuth tokens. It duplicated the entire header/token parsing logic. It needed that path closed with a pointer to where the existing logic was and explicit requirements to enhance it. Refactor was great upon reviewing and the PR size was reduced by more than half.