Datadog: Show Brittle Monitors

With the help of datadogs unofficial search_events endpoint we can see which of our monitors fail the most, a great place to start when trying to reduce alert spam.
(using Kennel)

rake brittle TAG=team:foo
analyzing 104 monitors ... 10.9s
Foo too high🔒
https://app.datadoghq.com/monitors/12345
Frequency: 18.95/h
success: 56x
warning: 44x

 

desc "Show how brittle selected teams monitors are TAG="
task brittle: "kennel:environment" do
  monitors = Kennel.send(:api).list("monitor", with_downtimes: false, monitor_tags: [ENV.fetch("TAG")])
  abort "No monitors found" if monitors.empty?

  hour = 60 * 60
  interval = 7 * 24 * hour
  now = Time.now.to_i
  max = 100

  data = Kennel::Progress.progress "analyzing #{monitors.size} monitors" do
    Kennel::Utils.parallel monitors do |monitor|
      events = Kennel.send(:api).list("monitor/#{monitor[:id]}/search_events", from_ts: now - interval, to_ts: now, count: max, start: 0)
      next if events.empty?

      duration = now - (events.last.fetch(:date_detected) / 1000)
      amount = events.size
      frequency = amount * (hour / duration.to_f)
      [monitor, frequency, events]
    end.compact
  end

  # spammy first
  data.sort_by! { |_, frequency, _| -frequency }

  data.each do |m, frequency, events|
    groups = events.group_by { |e| e.fetch(:alert_type) }
    groups.sort_by(&:first) # sort by alert_type

    puts m.fetch(:name)
    puts "https://zendesk.datadoghq.com/monitors/#{m.fetch(:id)}"
    puts "Frequency: #{frequency.round(2)}/h"
    groups.each do |type, grouped_events|
      puts "#{type}: #{grouped_events.size}x"
    end
    puts
  end
end

Ruby Capture Stdout Without $stdout

Rake’s sh for example cannot be captured with the normal $stdount = StringIO.new trick, but using some old ActiveSupport code that uses reopen with a Temfile seems to do the trick.

# StringIO does not work with rubies `system` call that `sh` uses under the hood, so using Tempfile + reopen
def capture_stdout
  old_stdout = STDOUT.dup
  Tempfile.open('tee_stdout') do |capture|
    STDOUT.reopen(capture)
    yield
    capture.rewind
    capture.read
  end
ensure
  STDOUT.reopen(old_stdout)
end

Listing Unmuted Datadog Alerts

Datadogs UI for alerting monitors also shows muted ones without an option to filter, which leads to overhead / confusion when trying to track down what exactly is down.

So we added an alerts task to  kennel  that lists all unmuted alerts and since when they are alerting, it also shows alerts that have no-data warnings even though Datadog UI shows them as not alerting.

bundle exec rake kennel:alerts TAG=team:my-team
Downloading ... 5.36s
Foo certs will expire soon🔒
https://app.datadoghq.com/monitors/123
Ignored cluster:pod1,server:10.215.225.122 39:22:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00


Foobar Errors (Retry Limit Exceeded)🔒
https://app.datadoghq.com/monitors/1234
Alert cluster:pod2 19:05:16

Finding latest AWS ECR image in all repositories

We have a lot of ECR repos that needed to be taken down, so I ran this little aws-cli + ruby + jq script to sanity check if all the images are old

require 'json'
repos = `aws ecr describe-repositories | jq .repositories[].repositoryName --raw-output`.split("\n")
pushed = repos.map do |repo|
  out = `aws ecr describe-images --repository-name #{repo} --output json --query 'sort_by(imageDetails,& imagePushedAt)[*]'`
  print '.'
  next unless image = JSON.parse(out).first
  Integer(image["imagePushedAt"])
end.compact
puts Time.at(pushed.max)

Ruby: Waiting for one of multiple threads to finish

We build a small project that watches multiple metrics until one of them finds something, I found ThreadsWait in the stdlib and it was easy to use it. Also added error re-raising so the threads do not die silently and cleanup.

require 'thwait'

def wait_for_first_block_to_complete(*blocks)
  threads = blocks.map do |block|
    Thread.new do
      block.call
    rescue StandardError => e
      e
    end
  end
  waiter = ThreadsWait.new(*threads)
  value = waiter.next_wait.value
  threads.each(&:kill)
  raise value if value.is_a?(StandardError)
  value
end

wait_for_first_block_to_complete(
  -> { sleep 5 }, -> { sleep 1 }, -> { sleep 2 }
) # will stop after 1 second