The Supervision Tree That Watches the Market While You Sleep

This system has been trading real capital for about eight months. It's been running on a DigitalOcean droplet, unattended, for about five of those. It opens positions, manages stops, and reconciles with the broker every 30 seconds. The data feeds fail regularly and recover automatically.

This post covers the supervision tree that makes that work. It also covers what I'd build differently, because the current structure has a real flaw that's worth understanding.

What "unattended" actually requires

"Stays alive" isn't sufficient. The actual requirements are:

Failures must be isolated. A crash in one component shouldn't take down the whole session. If the market data feed dies, the order manager should keep running. If the order manager restarts, it shouldn't lose the current position.

Crashes must be recoverable — automatically. At 2am, there's nobody to restart a process. The system has to do it itself, with the same configuration it started with.

State must survive restarts. A process that restarts with no memory of what it was doing is dangerous in a trading context. The broker is the source of truth. The system needs to reconcile with it after any recovery.

Silent failure is unrecoverable. A crash you know about is a problem you can fix. A feed that stops delivering data while the system assumes it's running is how you end up holding a position you don't know about.

The BEAM's OTP supervision model addresses all of these — but how you structure your tree determines how well.

The static tree

Here's the relevant portion of what starts when the application boots:

children = [
  # ...infrastructure (Repo, caches, HTTP client, token stores)...
  {Registry, keys: :unique, name: MyApp.Registry},
  {Registry, keys: :duplicate, name: MyApp.Traders},
  {Registry, keys: :duplicate, name: MyApp.Testers},
  {DynamicSupervisor, name: MyApp.Servers, strategy: :one_for_one},
  MyApp.Trading.EnsureSessionsAreRunning,
  # ...web endpoint, auth supervisor...
]

opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)

The infrastructure underneath — database, caches, HTTP client, token stores — starts under the same one_for_one supervisor and isn't relevant to the trading process lifecycle. The parts that matter here are the three Registries, the DynamicSupervisor, and EnsureSessionsAreRunning.

The dynamic layer — what starts when a trading session begins

When a user starts a trading session, RealtimeTrader.new/1 fires:

def new(%TradingSession{} = session) do
  case DynamicSupervisor.start_child(MyApp.Servers, {__MODULE__, session}) do
    {:ok, _} -> {:ok, session.id}
    :ignore  -> {:error, :ignore}
    {:error, error} -> {:error, error}
  end
end

That kicks off a startup chain via handle_continue:

def handle_continue(%TradingSession{} = session, _state) do
  broker_account = session.broker_account
  # Resolve the broker-specific AccountManager module and build its config.
  # Each broker implements its own AM behind a behavior — this is where
  # the session becomes broker-aware.
  am_module = broker_account.broker.account_manager()
  am_config = struct(Module.concat(am_module, :Config), broker_account: broker_account)

  case am_module.new(session.id, am_config) do
    {:ok, _pid} -> account_manager_started(session, am_module, am_config)
    {:error, :normal} -> account_manager_started(session, am_module, am_config) # already running
    error -> error
  end
end

The AccountManager starts, then the DataSource starts, then the feeds start. In total, a single trading session spawns something like this inside MyApp.Servers:

MyApp.Servers (DynamicSupervisor, one_for_one)
├── RealtimeTrader          (restart: :transient)
├── TradeStationAccountManager  (restart: :transient)
├── TradeStationDataSource  (restart: :transient)
├── TSSession               (restart: :transient)
├── Bars (data feed)        (restart: :transient)
├── Orders (account feed)   (restart: :transient)
└── Positions (account feed)(restart: :transient)

All siblings. All under the same flat DynamicSupervisor. Run two concurrent sessions and there are fourteen dynamic processes with no structural relationship between the ones that belong together.

But it has consequences worth understanding.

What `:transient` actually does

use GenServer, restart: :transient

:transient means the process is restarted by its supervisor if it terminates abnormally — anything other than :normal or :shutdown. It is not restarted on a clean exit.

This is the right choice for trading session processes. When the market closes and a session ends cleanly, you don't want the supervisor restarting it. But if TSSession crashes — or if a data feed gets killed unexpectedly — the supervisor will start it again automatically with the same arguments.

That's the tree doing its job.

The feed that dies constantly — and recovers

The TradeStation data feeds aren't WebSockets. They're chunked HTTP streams. This turns out to matter a lot.

TradeStation will idle-timeout a stream if it's quiet for too long. They'll also occasionally terminate a connection outright — with a literal "go away" message in the response body. They don't hide it.

When a feed process receives this, it exits with a non-normal reason. Since it's :transient, the DynamicSupervisor restarts it immediately with the same configuration it was started with. No intervention required. On startup, the feed fetches the two most recent candles, and the RealtimeTrader deduplicates as it receives them — so a restart mid-session doesn't produce a gap in the candle history.

The restart is automatic, the session keeps running, and there's a log entry to review the next morning.

The TSSession has an additional mechanism for shared resources. It's not one-per-session — it's one per broker credential. Multiple trading sessions can share a single TSSession. To avoid shutting down a session that's still needed, it uses a heartbeat:

def handle_info(:still_needed?, state) do
  cutoff = DateTime.utc_now() |> DateTime.add(@seconds_unneeded_to_shutdown * -1)

  if DateTime.compare(cutoff, state.last_needed) == :gt do
    debug("No longer needed, goodbye!")
    state = shutdown_all_feeds(state)
    {:stop, :normal, Map.put(state, :data_feeds, %{})}
  else
    data_feeds = prune_stale_data_feeds(state.data_feeds, cutoff)
    Process.send_after(self(), :still_needed?, 30_000)
    {:noreply, Map.put(state, :data_feeds, data_feeds)}
  end
end

Each AccountManager and DataSource sends a :still_needed cast every 15 seconds. If TSSession hasn't heard from anyone in 60 seconds, it terminates cleanly (:normal, so no restart). If it crashes, the :transient restart brings it back.

Position reconciliation — the broker is the source of truth

The broker is the source of truth. The RealtimeTrader enforces this with a periodic reconciliation loop.

Every 30 seconds, it polls the broker for the current position and compares it to internal state. If they disagree, it logs a warning and updates to match the broker. It doesn't try to automatically fix a discrepancy by placing orders — that way lies a bad time. The divergence log is the signal; the human decision about what to do with it comes separately.

Process.send_after(self(), :reconcile_position, 30_000)

This is also what makes restarts safe. If RealtimeTrader crashes and comes back up, it immediately reconciles with the broker. It doesn't assume its in-memory state was correct. A later post covers the reconciliation code in detail — and a production bug it took weeks to fully diagnose.

What `EnsureSessionsAreRunning` is actually doing

There's a process that starts in the static tree called EnsureSessionsAreRunning. This is doing something the supervision tree doesn't handle: recovering sessions after a node restart.

When the droplet reboots, MyApp.Servers starts empty. Active trading sessions that were running before the restart don't come back on their own — the DynamicSupervisor doesn't persist its children. EnsureSessionsAreRunning queries the database for sessions that should be active and starts them.

This is necessary. It's not a workaround for a supervision problem — it's the right place to put application-level startup logic that needs database access. The supervision tree handles process lifecycle; this handles application-level intent.

What I'd do differently

The flat DynamicSupervisor works, but running it taught me something. Processes that belong together have no structural relationship — and that matters when one of them crashes.

When TSSession restarts, it comes back fresh. The TradeStationAccountManager and TradeStationDataSource that were using it don't restart — they're still running, holding configuration that referenced the old session. They'll fail on their next call, which triggers their own restarts. Eventually the group recovers, but it takes a few failure cycles to get there.

The right structure is a per-session supervisor. Setting aside TSSession and its feeds for a moment — they're shared across sessions and belong under a per-credential supervisor of their own — the core session processes would look like this:

# What it could look like
SessionSupervisor (one_for_all)
├── TradeStationAccountManager
├── TradeStationDataSource
└── RealtimeTrader

one_for_all here means if any one of the three crashes, all three restart together. The session comes back as a unit, in a known good state, rather than staggering back to coherence through a chain of timeouts.

I haven't made this change yet. Refactoring process supervision in a running system that manages real capital requires careful sequencing. The current design is stable enough that the refactor isn't the highest priority right now. But it's the right next step, and it'll happen.

The tree is running, it's managing real capital, and now you've seen what it looks like.

The next three posts cover what happened when I tried to integrate IBKR — Interactive Brokers — as the primary broker. It starts with a crypto primitive that Elixir couldn't handle, runs through a Python port I built to work around it, and ends with a class of silent failures in production that took weeks to fully understand. By the end, I had walked away from the integration entirely.

It's a complete arc: a language bridge that worked exactly as designed, a broker that didn't, and the decision to start over with something better.

The supervision refactor — migrating a live flat DynamicSupervisor to per-session supervisors without a full shutdown — is the next architectural problem I'm working through. If you've done it, I'd like to hear how.

What "unattended" actually requires

The static tree

The dynamic layer — what starts when a trading session begins

What :transient actually does

The feed that dies constantly — and recovers

Position reconciliation — the broker is the source of truth

What EnsureSessionsAreRunning is actually doing

What I'd do differently

Discussion

More in this series

Tick-to-Trade in Elixir: A GenServer handle_info Pipeline

Designing Data Flow in an Elixir Trading System

GenServer State Management in Elixir: A Production Order Book

What `:transient` actually does

What `EnsureSessionsAreRunning` is actually doing