A polite scraper hits a flaky source and waits 30 seconds for a TCP timeout. It does this 100 times an hour. That is 50 minutes of wall-clock time per hour spent waiting on one dead source. Meanwhile every other adapter is healthy and the dashboard looks fine. The fix is a circuit breaker, and in Elixir it is about 80 lines of GenServer wrapped around a few counters.
#The three states
A circuit breaker is a tiny state machine:
- closed β traffic flows, failures are counted.
- open β traffic is rejected immediately with
{:error, :circuit_open}. - half-open β one probe request is allowed; success closes, failure re-opens.
The trick is the transition rules. Too sensitive and you flap. Too tolerant and you are back to 50 minutes per hour of timeouts.
#The rules that actually work
- Open on: 5+ failures in a 30-second rolling window OR 3 consecutive failures.
- Cool-down: exponential, starting at 10s, doubling up to 5 minutes.
- Half-open probe: exactly one in-flight probe. Concurrent probes defeat the purpose.
- Close on: probe success.
defmodule PrismaticOsint.Breaker do
use GenServer
def call(name, fun) do
case GenServer.call(name, :state) do
:open -> {:error, :circuit_open}
:closed -> record(name, safely(fun))
:half_open -> probe(name, fun)
end
end
defp safely(fun) do
try do
{:ok, fun.()}
catch
kind, reason -> {:error, {kind, reason}}
end
end
endThe try/catch wraps the user function so a crash never leaves the breaker in an inconsistent state. The :error path feeds the counter; the :ok path resets it.
#Wire it into the pipeline
Every OSINT adapter gets wrapped at the boundary:
def search(query, opts) do
Breaker.call(breaker_for(__MODULE__), fn ->
HTTPoison.get!(url(query), [], recv_timeout: 5_000)
end)
endThat is it. The adapter does not know about the breakerβs internal state. The pipeline does not know about the adapterβs flakiness. Both stay boring, which is the goal.
#Telemetry is non-negotiable
Every state transition emits telemetry:
:telemetry.execute([:osint, :breaker, :open], %{count: 1}, %{adapter: name})
:telemetry.execute([:osint, :breaker, :half_open], %{count: 1}, %{adapter: name})
:telemetry.execute([:osint, :breaker, :closed], %{count: 1}, %{adapter: name})A breaker that opens and never closes is a source that died. A breaker that flaps is a source with intermittent issues (and possibly a wrong cool-down). Without observability you cannot tell which is which.
#Where to go next
- Academy: OTP Fundamentals β the GenServer under the breaker
- Glossary: Fault Tolerance, Retry, Rate Limiting, Telemetry, Observability
Fail fast. Recover quietly. Let the healthy sources do the work.