Turnbull, James. Art of Monitoring. June 2016. Ebook via O’Reilly Queue.
Selective reading to focus on overall approach and application monitoring, since those are the parts relevant to me.
Approach:
- Proposes an approach that copes well with dynamic hosts at large scale (one-way flow of data from host to collector, rather than Nagios scans).
- Replaces Booleans with numbers - down becomes “no new measure since 5 seconds ago”, up becomes “here is some actually useful data”. Static thresholds can be replaced by anomaly detection.
Used a bunch of tooling (Riemann, collects, statsd, graphite/whisper/grafana, ELK), but author has a newer book out on Prometheus, since that has emerged as especially popular lately.
App metrics: Tech events, performance, etc. Help guide devs or ops.
Business metrics: Often the same events but different measures - count order prices vs just number of orders. Help guide business decisions and support that IT creates revenue.
New to me is the idea of recording metrics (think StatsD counters or timers) that mirror your structured logging. I’m not sure what this buys you if you already have structured logs, except that perhaps the logs may be pruned due to log level or other config, while the metrics will not. (Compare analytics vs logging - people will religiously mute ALL their logs in Release “because logs kill kittens”, while letting Firebase or Fabric do scads of work, including regularly pinging on a timer. Mumble mumble letter over spirit of guidance mumble mumble.)
Another neat idea is to send non-PROD logs/metrics to a central host vs a sidecar daemon - less trouble for DEV.
Easy to miss events:
- Deployments: You want to see these alongside other metrics!
- Maintenance: You want to shut up alerts when you intentionally bring the system down.
It’s unfortunate there’s no standard for structured logging, but I can’t say I’m surprised - it’s too “thin” concept to support a standard. No-one feels they need it.