Where to Use AI FOR Development: A Practical Guide for Engineers and Managers

There’s a lot of noise out there about AI generating code for you, creating tests, filing bugs, and possibly even conducting your sprint planning meetings. It’s easy to get swept up in the hype. You’ve probably seen claims like “AI can build your app from a prompt” or “Ship twice as fast with half the team”. Sounds appealing doesn’t it? The reality, as always, is more nuanced.

In this article, I want to puncture some of the more inflated expectations and instead offer a grounded view of where AI can help in software development today. What works, what doesn’t, and what might be worth trying if you squint hard enough. The lens I want to use is risk. Because fundamentally, that’s what software development is: a careful game of managing risk against code.

Software Development Is (Still) Complicated

People like to describe programming as a science. In truth, it’s a profoundly human activity, one filled with ambiguity, interpretation, negotiation, and compromise. It starts with a conversation (or a vague Jira ticket) and ends, if you’re lucky, with something that’s not only functional, but also secure, performant, diagnosable, and supportable. And ideally delivered on time.

That’s before you even get into the legal constraints (GDPR, HIPAA, PCI-DSS), integration points, performance targets, platform requirements, accessibility expectations, internationalisation, and budgetary or staffing limitations. In short, developing anything more complex than a simple CRUD app is a process fraught with moving parts, conflicting demands, and sharp edges. Every decision is a trade-off. Every feature carries risk.

Risk Is the Heart of Software Development

At its core, software development is about risk management. You’re constantly weighing one thing against another: speed versus quality, coverage versus confidence, complexity versus maintainability. You never have enough time, tests, or people. You never really know how the system will behave until it’s in production, and sometimes not even then. Even the best testing regimes can’t prove a system is perfect. All we can do is gather enough evidence to make a release feel safe. That’s why we have QA. That’s why we have CI/CD pipelines. That’s why we automate, measure, and monitor. Not because we expect zero defects, but because we want to minimise the chance of failure and recover quickly if we make a mistake.

So if you want to use AI in software development, the right question isn’t “Can it code?” It’s “Does it reduce my risk?”

Why AI Should Be Evaluated Like Any Other Risk Control

Once you reframe AI as a tool for managing development risk, everything else falls into place. You start to ask more useful questions:

  • Does this AI tool make my team more productive or distract them with shiny toys?
  • Can it reduce the number of defects that escape detection?
  • Will it catch more edge cases in tests?
  • Is it cost-effective to integrate?
  • How much human oversight does it need?
  • What new risks does it introduce?

Because yes, AI tools come with risks of their own. Some are subtle, such as poor code suggestions that slip past review. Others are more obvious, such as a test generator that fails to distinguish between correct and incorrect outputs, inadvertently encoding bugs into your regression suite. And then there’s the legal risk, especially if you’re using AI trained on unprovidenced or proprietary data sources.

The LLM Illusion

Most of the AI-powered dev tools you see today are built on Large Language Models (LLMs). These are powerful autocomplete machines. They’re trained on vast amounts of code and text, and they excel at predicting what might come next based on context. But they don’t understand the code they’re generating. They don’t reason about correctness. They don’t simulate execution paths or anticipate runtime behaviour. They mimic the patterns they’ve seen most often.

As an aside, the LLM’s ability is directly connected to the amount of training data it’s given. So in general, any LLM-based code tool will ‘know’ a lot more about older software versions and popular programming languages than the latest features in something or the more arcane developer tools. That makes LLMs adept at generating boilerplate, suggestions, renaming, scaffolding, and stubbing out code that appears plausible and coherent. However, it also renders them unreliable for tasks that are safety-critical, performance-sensitive, or compliance-driven unless paired with a robust human review process.

That doesn’t mean they’re useless. It just means you need to know their limits. And if you want more reliable AI tooling, you’ll need to look beyond LLMs.

Other Forms of AI: Beyond Guesswork

Other types of AI do more than guess. Reinforcement Learning (RL), for instance, operates by simulating millions of interactions and learning from the outcomes. In testing, RL can be used to learn which input combinations break a system, or which execution paths are under-tested. Symbolic execution engines and constraint solvers can also help, especially in security-critical code. These approaches are slower to develop and harder to scale across arbitrary codebases, but they’re also more rigorous. These approaches can never determine that the behaviour of your application is what’s required, but they can accurately predict actual behaviour. That’s useful for when you’re trying to prevent (or at least discover) regressions. Some tools are already using these techniques to generate high-precision test suites, especially for Java and other strongly typed languages. They’re not as flashy as Copilot, but they’re often more trustworthy.

Where AI Can Help: A Risk-Oriented View of the SDLC

LLMs are potent tools for addressing the human aspects of software development. They are much less helpful when generating code or tests, etc. That’s not to say that you shouldn’t use them – but do so in a way where the generated output is quickly reviewed and can be discarded or amended with little impact. Let’s break it down by stage. Here are the most common places in the SDLC where AI can (with caveats) reduce risk or effort.

Requirement / Issue / Problem /Summarising

What AI Does Well:
AI, especially Large Language Models (LLMs), can distil customer interviews, support tickets, Slack threads, and other messy input into structured user stories or feature themes. This works because LLMs are trained on massive amounts of natural language and are good at summarisation and thematic clustering.   Whether looking to summarise requirements or find similar issues, LLMs can help understand the core elements.

Why It’s Useful:
LLMs are ‘born’ to understand human text and detect patterns in it. Tools like OpenAI GPT-4, Anthropic Claude, or even lightweight open-source models like Mistral can extract real intent from transcripts. Solutions like Kraftful even automatically turn customer feedback into feature suggestions. Teams can reduce early-cycle churn and achieve a clearer backlog more quickly.

Risks to Watch:
LLMs can hallucinate or strip away subtle intent. They might group unrelated concerns or misinterpret ambiguous phrasing. Human validation is non-negotiable.  LLMs have been known to invert the intent of messages, so keep a close eye until you’re comfortable that it’s as accurate as you need it to be.  In any case, always have a human review process to sample what the LLM is producing.  Using AIs is not a fire-and-forget situation. You must have humans in the loop to sample what’s produced and keep it on track.

How to Measure:

  • Time saved during backlog grooming 
  • Reduction in post-release feature rework or misaligned stories
  • Fewer clarification loops with stakeholders

Design

What AI Does Well:
Utilise AI as a guide for common practices.  AI can suggest design patterns, generate architectural diagrams, or draw inspiration from similar historical designs. Tools like GitHub Copilot, Cursor.sh, and Codeium can be used as brainstorming aids at this phase.

Why It’s Useful:
LLMs and vector search can identify past design solutions similar to your current problem. This helps teams explore more options earlier and avoid “defaulting” to the same old approach. 

Risks to Watch:
Treat AI suggestions as a source of inspiration, not a blueprint. The risk is that teams adopt AI output without scrutiny, mistaking suggestion for solution.  Like all LLMs trained on public data, any output will default to the one most commonly likely to fit your desires.  The latest ideas, trends, tools, etc, will not score high enough in its deliberations unless you help it to.  

How to Measure:

  • Number of design options considered before finalising
  • Reduction in repetitive architectural mistakes
  • Time to generate and validate an initial design proposal

Implementation

What AI Does Well:
Boilerplate code, method stubs, common idioms etc : AI can generate these quickly and contextually. This is where tools like Tabnine, Amazon CodeWhisperer, and GitHub Copilot shine.

Why It’s Useful:
These tools are pattern matchers trained on large corpora of real code. They excel at filling in the blanks for common constructs, freeing developers to focus on unique logic.

Risks to Watch:
Generated code might appear correct but be subtly incorrect: either semantically or contextually. Developers must review all generated output as if it were written by a junior teammate who’s eager but not always accurate. LLMs do not understand code and do not reason about it. The boilerplate code that’s generated still needs to be reviewed.  Keep usage down to a manageable set of generated code per step.  As mentioned many times, never generate technical artefacts (code, tests, deployment YAML, etc.) that you can’t review with a professional eye.  

How to Measure:

  • Code throughput (e.g., lines per day/week)
  • Time saved on boilerplate tasks

Testing

What AI Does Well:

LLMs can scaffold basic unit tests, but deeper benefits come from using Symbolic Execution or Reinforcement Learning (RL) to explore paths and inputs. Diffblue Cover (symbolic), Sapienz by Meta (search-based), or EvoSuite (evolutionary algorithms) demonstrate this in action. 

Why It’s Useful:
These tools can discover edge cases, increase test coverage, and produce regression safety nets far beyond what most teams can write manually.  

Risks to Watch:
LLM-generated tests often include meaningless assertions (“assertEquals(true, true)”) or miss logical branches. RL/symbolic tools can require a learning curve and setup time. 

How to Measure:

  • Increase in branch or mutation coverage
  • Reduction in escaped defects
  • Lower manual test-writing overhead

Code Review & Static Analysis

What AI Does Well:
AI is very good at identifying low-hanging fruit, such as styling violations, unreachable code, poor naming conventions, or anti-patterns. GitHub’s CodeQL, SonarCloud, and DeepCode are popular options in this regard.

Why It’s Useful:
This kind of code-level pattern matching and rule enforcement is where traditional static analysis meets AI. It reduces the burden on human reviewers to nitpick syntax or style.

Risks to Watch:
AI won’t understand project-specific context or architectural concerns. It can’t (yet) assess trade-offs or intent.

How to Measure:

  • Time saved during PR reviews
  • Number of issues resolved before human review
  • Fatigue reduction for senior reviewers

Deployment

What AI Does Well:
AI can predict load patterns, identify misconfigurations, and simulate the effects of rollouts. Tools like Argo Rollouts and Harness are increasingly utilising AI to monitor canaries and blue-green deployments.

Why It’s Useful:
With enough historical observability data (logs, metrics), AI can detect anomalies before users feel the impact.

Risks to Watch:
Garbage in, garbage out. If your telemetry data is incomplete or your failure modes are novel, AI predictions may not be helpful.

How to Measure:

  • Reduction in rollback incidents
  • Faster time-to-diagnose failed deploys
  • Lower configuration error rate

Post-Release Monitoring

What AI Does Well:
AI can cluster logs, surface new anomalies, and correlate across systems. New Relic Lookout, Dynatrace Davis AI, and Datadog Watchdog all apply machine learning to this space.

Why It’s Useful:
You get faster detection of the unknown unknowns, especially when dealing with distributed systems or microservices where symptoms manifest in non-obvious ways.

Risks to Watch:
AI might flag noisy or irrelevant anomalies. It also can’t explain why something broke. You still need to have humans who understand observability and what’s being revealed.

How to Measure:

  • Improvements in MTTD (Mean Time To Detect) and MTTR (Mean Time To Resolve)
  • Reduction in incident triage effort
  • Higher signal-to-noise in alerting

So… Can AI Build the Whole App?

In theory, maybe someday. In practice, not yet. And certainly not safely. Generating an entire application from a prompt is mostly a gimmick. What you often get is demo-quality scaffolding with unscalable architecture, no test coverage, poor security hygiene, and numerous hardcoded assumptions. Yes, it might be helpful for quick MVPs or throwaway prototypes. But if you want something maintainable, testable, secure, and integrated into a real business context, human developers are still essential.

Practical, Not Magical

AI can help in software development. But it won’t do your development for you. Used well, AI tools can act like skilled interns or a 24-hour assistant.. Aiding your team, handling the boring bits, surfacing ideas, and making some tasks faster or safer. Used poorly, they create tech debt, propagate bugs, and distract developers from real work.

So if you’re going to bring AI into your SDLC, do it deliberately. Ask:

  • What risk does this tool reduce?
  • What new risks does it create?
  • How will we know it’s helping?

The top rules of thumb to follow  for LLM-based tools are 

A:  Use as a timesaver: Don’t ask the tool to generate anything you couldn’t do yourself: If you don’t understand it, you can’t review or debug it.

B:  Keep the ratio of prompt to generated material low.  One prompt used to create 1000 lines of code or text will keep you busy applying corrections for a considerable time.  One prompt used to generate 10 or 20 lines, possibly as a super-autocomplete process in the IDE, is quicker to review and has less impact if you discard it. 

You can read more of my thoughts about using AI dev tools practically here

FINAL THOUGHTS

It’s essential to realise that the fundamental objective of using AI in the development process is to act as a helper to reduce other risks. Your decision to use or not use AI, as well as how you use it, is a key aspect of risk management.  You wouldn’t trust a junior developer to architect a high-frequency trading solution, but you might trust them to design a test plan.  Treat your use of AIs in the same way.  Use them as risk mitigators, productivity enablers, not as replacements for human beings and indeed not as autonomous entities.  If your AI tools don’t have human oversight and a way to adjust their behaviour, then it’s inevitable that things will go wrong. And it may take you a long time to discover and recover.. 

Final, final thought?  Always, always, trust your senior engineers over your AI tools. Because AI might generate code, but only humans write software.

Total
0
Shares
Previous Post
Software Architecture

Demystifying Software Architecture: Styles/Patterns – PART 2

Next Post

Apache Causeway – An Introduction

Related Posts