Most production software is not a greenfield microservices architecture. It is a monolith — often years old, written in Java, Python, or C#, handling critical business logic that nobody wants to rewrite. Adding AI capabilities to these systems requires a pragmatic approach that works with the existing codebase rather than against it.
The Sidecar Pattern for AI Integration
The cleanest way to add LLM capabilities to a monolith is through a sidecar service. Instead of embedding AI libraries directly into your application, you deploy a lightweight API service alongside it that handles all communication with the language model. Your monolith makes HTTP calls to the sidecar, which manages prompt construction, API keys, rate limiting, and response parsing.
This separation keeps your monolith's dependency tree clean. AI libraries and their dependencies evolve rapidly — new model versions, updated SDKs, changing API contracts. By isolating this in a separate service, you can update the AI integration independently without deploying your entire monolith. The sidecar can be written in a language that has better AI ecosystem support, like Python, even if your monolith is in Java.
Handling Latency with Async Processing
LLM inference is slow compared to typical database queries. A simple text generation request can take 2-10 seconds depending on the model and output length. If your monolith handles this synchronously, it will tie up request threads and degrade performance for all users.
The solution is asynchronous processing. When a user triggers an AI feature, the monolith publishes a message to a queue (RabbitMQ, SQS, or Redis Streams), returns immediately with a job ID, and the AI sidecar processes the request in the background. The frontend polls for completion or uses WebSockets to receive the result when it is ready.
This pattern also provides natural backpressure. If the AI service is overwhelmed, messages queue up rather than causing timeouts in the monolith. You can scale the number of AI workers independently based on queue depth, and failed requests can be retried automatically without involving the main application.
Prompt Engineering in Production
In a production monolith, prompts are not ad-hoc strings — they are part of your business logic. Treat them like database queries: version them, test them, and review them in code review. Store prompt templates separately from application code so they can be updated without a full deployment cycle.
Context injection is where most of the engineering work happens. Before sending a prompt to the model, you need to gather relevant data from your monolith's database — customer information, transaction history, product details. This context gets inserted into the prompt template along with the user's input. The quality of the AI response depends directly on the quality and relevance of this context.
Always validate and sanitize model outputs before using them in business logic. LLMs can generate malformed JSON, hallucinate data, or produce responses outside expected boundaries. Build parsing logic that handles these cases gracefully, with fallback paths that do not break the user experience when the AI produces unexpected output.
Cost Control and Monitoring
API-based LLM usage is billed per token, and costs can escalate quickly if not monitored. Implement token counting before sending requests so you can estimate costs in advance. Set per-user and per-tenant rate limits to prevent runaway usage. Cache common queries and their responses to avoid paying for the same inference twice.
Log every LLM interaction: the prompt, the response, the latency, and the token count. This data is essential for debugging quality issues, optimizing prompts, and projecting costs. It also provides the training data you might need later if you decide to fine-tune a smaller, cheaper model to replace API calls for common use cases.
Adding AI to a monolith does not require a rewrite. With the right patterns — sidecar services, async processing, structured prompt management, and cost controls — you can deliver powerful AI features while keeping your existing codebase stable and maintainable.