Trading Systems
Building Reliable Trading Automation: From Script to System
A practical architecture for trading bots that survive retries, stale data, partial fills, restarts, and exchange outages.
The first version of a trading bot is usually a loop: fetch a price, evaluate a condition, and send an order. That version is useful because it proves the API credentials, symbol mapping, and basic strategy logic. It is not yet a reliable trading system.
Production failures rarely come from the line that calculates an indicator. They come from duplicated orders after a retry, a partially filled order that the database records as complete, a process restart between two writes, or an exchange response that arrives after the local timeout. The strategy can be correct and the account can still end up with the wrong inventory.
This guide describes the controls I use when turning an automation script into a system that can be inspected, stopped, and recovered. The examples use TypeScript-like pseudocode, but the design applies equally to Python workers, serverless jobs, and long-running services.
Separate decision-making from execution
A strategy should produce an intent, not call the exchange directly. An intent describes what the strategy wants: buy a quantity, reduce exposure, cancel an order, or do nothing. A separate execution layer decides whether that intent is still valid and how to submit it safely.
This boundary makes testing much easier. Historical tests can evaluate intents without mocking an exchange SDK. Production execution can enforce account-level limits without knowing the details of every strategy. It also prevents a small strategy change from bypassing controls such as maximum notional exposure.
- Strategy input: normalized market data, current inventory, open orders, and configured limits.
- Strategy output: a deterministic intent with a reason and the market-data timestamp used.
- Execution input: the intent plus fresh account and exchange state.
- Execution output: a durable record of submission, acknowledgement, fills, and errors.
type TradeIntent =
| { type: "buy"; symbol: string; quantity: number; reason: string }
| { type: "sell"; symbol: string; quantity: number; reason: string }
| { type: "hold"; reason: string };
function decide(state: StrategyState): TradeIntent {
if (state.marketDataAgeMs > 5_000) {
return { type: "hold", reason: "market data is stale" };
}
if (state.positionNotional >= state.maxPositionNotional) {
return { type: "hold", reason: "position limit reached" };
}
return calculateSignal(state);
}Make order submission idempotent
Network calls have an uncomfortable property: a timeout does not tell you whether the remote action happened. If the exchange accepted an order but the response was lost, blindly retrying can create a second order.
Every actionable intent should have a stable idempotency key derived from information that identifies the decision. Store that key before submission. On retry, look for an existing local execution and query the exchange by client order ID before creating anything new.
Do not generate the key from the current clock inside the retry loop. The same decision must produce the same key across process restarts. A useful input is strategy ID, symbol, decision interval, side, and planned sequence number.
const clientOrderId = hash({
strategyId,
symbol,
candleOpenTime,
side,
sequence,
});
const execution = await executions.reserve(clientOrderId, intent);
if (execution.exchangeOrderId) {
return execution;
}
const existing = await exchange.findByClientOrderId(clientOrderId);
if (existing) {
return executions.attachExchangeOrder(clientOrderId, existing);
}
return exchange.submit({ ...intent, clientOrderId });Treat orders as state machines
An order is not simply pending or complete. It may be created locally, submitted, acknowledged, partially filled, fully filled, cancelled, rejected, or in an unknown state after a timeout. Model those states explicitly and restrict which transitions are legal.
Partial fills deserve special attention. If a ten-unit order fills six units and is then cancelled, the position changed by six units. A system that records only the final cancelled status loses the economically important event.
Store fills as append-only events whenever possible. Derive filled quantity and average price from those events. This gives you an audit trail and makes reconciliation less destructive.
- Never overwrite an acknowledged exchange order with a generic retry error.
- Record exchange timestamps as well as local receipt timestamps.
- Keep requested quantity, filled quantity, and remaining quantity separate.
- Allow an explicit unknown state that triggers reconciliation instead of guessing.
Reconcile local state with the exchange
The exchange is the source of truth for orders and balances, while your database is the source of truth for why an action was attempted. Reliable automation needs both. A reconciliation worker periodically compares open orders, recent fills, and balances.
Reconciliation should also run at startup before strategies are allowed to trade. A process may have stopped after an exchange accepted an order but before the local database was updated. Starting the strategy immediately can compound the mismatch.
When a mismatch is found, prefer pausing the affected strategy and creating an operator-visible incident over automatically inventing a correction. Automatic repair is appropriate only when the expected state transition is unambiguous.
async function startStrategy(strategyId: string) {
const report = await reconcileStrategy(strategyId);
if (report.unknownOrders.length > 0 || report.balanceMismatch) {
await strategies.pause(strategyId, "reconciliation required");
await alerts.send(report);
return;
}
await scheduler.enable(strategyId);
}Put risk controls outside the strategy
A strategy-level position limit is useful, but account-level controls must live in the execution boundary. Otherwise two strategies can each remain below their own limit while exceeding the account limit together.
The execution layer should reject or resize intents that violate maximum order notional, total symbol exposure, account drawdown, order frequency, or stale-data rules. These checks should run again immediately before submission because account state may have changed since the strategy made its decision.
A kill switch should stop new orders without preventing reconciliation and cancellation. Shutting down the entire process can make the system less safe because it removes the workers needed to observe and unwind existing orders.
- Maximum order and position notional by symbol.
- Maximum aggregate exposure across strategies.
- Daily loss or drawdown threshold.
- Maximum number of submissions per minute.
- Freshness limits for prices, balances, and strategy inputs.
- Manual pause that remains active after deploys and restarts.
Design observability around decisions
Infrastructure metrics such as CPU and memory are not enough. The most useful operational question is: why did the system place, skip, resize, or cancel this order?
Attach a decision ID to the strategy evaluation, risk checks, execution record, exchange order, and fill events. Structured logs can then reconstruct one decision without searching by approximate timestamps.
Alerts should describe an action an operator can take. “Worker failed” is less useful than “BTC strategy paused because exchange balance differs from local balance by 0.012 BTC; reconciliation incident 184 is open.”
- Decision latency from market-data timestamp to order acknowledgement.
- Count of skipped decisions by reason.
- Orders in unknown state and time spent in that state.
- Difference between expected and exchange-reported balances.
- Realized fees and slippage compared with strategy assumptions.
Test failures, not only signals
A backtest answers whether a strategy rule would have produced attractive historical decisions. It does not answer whether the execution service behaves correctly when the exchange times out.
Add tests that inject failures between durable steps: after reserving an execution but before submission, after submission but before storing the order ID, and after a partial fill but before processing the event. Restart the worker and verify that it converges to the correct state without duplicating the order.
A small paper-trading environment is also valuable, but it should not replace deterministic failure tests. Sandbox exchanges often behave more cleanly than production and may not reproduce delayed or out-of-order events.
Conclusion
Reliable trading automation is mostly state management under uncertainty. The strategy determines what you would like to do; the surrounding system determines whether that action can be performed once, within limits, and with a traceable result.
Before adding another indicator, make sure the system can answer five questions: What decision was made? Which limits were checked? Was an order submitted exactly once? What did the exchange actually fill? Can the strategy be paused and reconciled after a restart? If those answers are durable, the automation is becoming a system rather than a script.