Datadog: Show Metric Usage Warning From HPA Metrics

Problem

When editing metrics in datadog UI (i.e. /metrics/summary) a warning is shown when editing an in-use metric (i.e. a dashboard or monitor uses it). But if that metrics is used by a Kubernetes HorizontalPodAutoscaler, no such warning will show.

Solution

Generate a dashboard that uses 1 widget for every query an HPA uses.

require 'kennel'

class HpaDashboard
  SOURCE_METRIC = "datadog.cluster_agent.external_metrics.delay_seconds".freeze
  attr_reader :id

  def initialize(id, timeframe:)
    @id = id
    @api = Kennel::Api.new
    @from = Time.now.to_i - timeframe
  end

  # see https://docs.datadoghq.com/api/latest/metrics/#get-active-metrics-list
  # this has an undocumented limit of 250000 metrics so we can't just use super old @from
  # also tried /api/v2/metrics which returns similar results but is even slower (filtering it with 'queried' + big window did not help)
  def available_metrics
    @api.send(
      :request, :get, "/api/v1/metrics",
      params: { from: @from }
    ).fetch(:metrics).to_set
  end

  def queries_used_by_any_hpa
    @api.send(
      :request, :get, "/api/v1/query",
      params: {
        query: "avg:#{SOURCE_METRIC}{*} by {metric}",
        from: @from,
        to: Time.now.to_i
      }
    ).fetch(:series).map do |data|
      data.fetch(:scope).split(",").to_h { |t| t.split(":", 2) }["metric"]
    end.uniq
  end

  # covert fallout from query normalization to find actual metrics
  # for example default_zero(foo{a:b}) is converted to "default_zero_foo_a:b"
  # this ignores when multiple metrics are in a single query for example a / b * 100
  # since a and b are usually the same
  def extract_metrics(queries)
    queries = queries.dup
    queries.each do |query|
      query.sub!(/\.total_\d+$/, ".total") # math leftover *.total_100 -> *.total
      query.sub!(/^_*(ewma_\d+|default_zero)_*/, "") # remove math
    end
    queries.uniq!
    queries.sort! # for debug printing and to keep the dashboard stable
    queries.to_set
  end

  # since available_metrics is not reliable (hits limit or just has old data)
  # we verify each potentially unknown metric 1-by-1 by hitting this cheap endpoint
  # https://docs.datadoghq.com/api/latest/metrics/?code-lang=curl#get-metric-metadata
  def slow_filter_unknown!(unknown)
    unknown.select! do |metric|
      print "Verifying potentially unknown metric #{metric} ..."
      not_found = @api.send(:request, :get, "/api/v1/metrics/#{metric}", ignore_404: true)[:error]
      print "#{not_found ? "not found" : "found"}\n"
      not_found # keep the truly not found
    end
  end

  def update(used_metrics)
    attributes = {
      title: "HPA metrics used",
      description: <<~DESC,
        1 widget for each metric used in compute maintained kubernetes clusters (anything that reports #{SOURCE_METRIC})
        Automatically filled by a `rake hpa_dashboard` cron from kennel GHA.
        Last updated: #{Time.now} #{$stdout.tty? ? "manually" : RakeHelper.ci_url}
      DESC
      layout_type: "ordered",
      reflow_type: "auto",
      tags: ["team:compute", "team:compute-accelerate"],
      widgets: used_metrics.map do |m|
        {
          definition: {
            title: m,
            type: "timeseries",
            requests: [
              {
                response_format: "timeseries",
                queries: [
                  {
                    name: "query1",
                    data_source: "metrics",
                    query: "avg:#{m}{*}"
                  }
                ],
                display_type: "line"
              }
            ]
          }
        }
      end
    }
    @api.update("dashboard", @id, attributes)
  end
end

desc "Update hpa dashboard to track all currently used external metrics people that change metrics in the UI see that they are used"
task hpa_dashboard: "kennel:environment" do
  dashboard = HpaDashboard.new(DASHBOARD_ID, timeframe: 24 * 60 * 60)

  available_metrics = dashboard.available_metrics
  puts "Found #{available_metrics.size} available metrics"

  used_queries = dashboard.queries_used_by_any_hpa
  puts "Found #{used_queries.size} used queries"

  used_metrics = dashboard.extract_metrics(used_queries)
  puts "Found #{used_metrics.size} used metrics"

  # validate we found everything
  unknown = used_metrics - available_metrics
  dashboard.slow_filter_unknown! unknown if unknown.size < 100
  if unknown.any?
    $stdout.flush # otherwise mixes with stderr in GHA
    abort <<~MSG
      #{unknown.size} unknown metrics found, these would not be displayable on the dashboard, improve parsing code
      usually that means some part of the metrics got mangled and it cannot be found in datadog
      see https://datadoghq.com/metric/summary to find valid metrics

      #{unknown.join("\n")}
    MSG
  end

  dashboard.update used_metrics
  puts "Updated dashboard https://datadoghq.com/dashboard/#{dashboard.id}"
rescue Exception # rubocop:disable Lint/RescueException
  unless $stdout.tty? # do not spam slack when debugging
    send_to_slack <<~MSG
      HPA dashboard update failed #{RakeHelper.ci_url}, fix it!
    MSG
  end
  raise
end

Datadog: Show Brittle Monitors

With the help of datadogs unofficial search_events endpoint we can see which of our monitors fail the most, a great place to start when trying to reduce alert spam.
(using Kennel)

rake brittle TAG=team:foo
analyzing 104 monitors ... 10.9s
Foo too high🔒
https://app.datadoghq.com/monitors/12345
Frequency: 18.95/h
success: 56x
warning: 44x

 

desc "Show how brittle selected teams monitors are TAG="
task brittle: "kennel:environment" do
  monitors = Kennel.send(:api).list("monitor", with_downtimes: false, monitor_tags: [ENV.fetch("TAG")])
  abort "No monitors found" if monitors.empty?

  hour = 60 * 60
  interval = 7 * 24 * hour
  now = Time.now.to_i
  max = 100

  data = Kennel::Progress.progress "analyzing #{monitors.size} monitors" do
    Kennel::Utils.parallel monitors do |monitor|
      events = Kennel.send(:api).list("monitor/#{monitor[:id]}/search_events", from_ts: now - interval, to_ts: now, count: max, start: 0)
      next if events.empty?

      duration = now - (events.last.fetch(:date_detected) / 1000)
      amount = events.size
      frequency = amount * (hour / duration.to_f)
      [monitor, frequency, events]
    end.compact
  end

  # spammy first
  data.sort_by! { |_, frequency, _| -frequency }

  data.each do |m, frequency, events|
    groups = events.group_by { |e| e.fetch(:alert_type) }
    groups.sort_by(&:first) # sort by alert_type

    puts m.fetch(:name)
    puts "https://zendesk.datadoghq.com/monitors/#{m.fetch(:id)}"
    puts "Frequency: #{frequency.round(2)}/h"
    groups.each do |type, grouped_events|
      puts "#{type}: #{grouped_events.size}x"
    end
    puts
  end
end

Listing Unmuted Datadog Alerts

Datadogs UI for alerting monitors also shows muted ones without an option to filter, which leads to overhead / confusion when trying to track down what exactly is down.

So we added an alerts task to  kennel  that lists all unmuted alerts and since when they are alerting, it also shows alerts that have no-data warnings even though Datadog UI shows them as not alerting.

bundle exec rake kennel:alerts TAG=team:my-team
Downloading ... 5.36s
Foo certs will expire soon🔒
https://app.datadoghq.com/monitors/123
Ignored cluster:pod1,server:10.215.225.122 39:22:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00


Foobar Errors (Retry Limit Exceeded)🔒
https://app.datadoghq.com/monitors/1234
Alert cluster:pod2 19:05:16