Datadog: Show Brittle Monitors

With the help of datadogs unofficial search_events endpoint we can see which of our monitors fail the most, a great place to start when trying to reduce alert spam.
(using Kennel)

rake brittle TAG=team:foo
analyzing 104 monitors ... 10.9s
Foo too highđź”’
https://app.datadoghq.com/monitors/12345
Frequency: 18.95/h
success: 56x
warning: 44x

 

desc "Show how brittle selected teams monitors are TAG="
task brittle: "kennel:environment" do
  monitors = Kennel.send(:api).list("monitor", with_downtimes: false, monitor_tags: [ENV.fetch("TAG")])
  abort "No monitors found" if monitors.empty?

  hour = 60 * 60
  interval = 7 * 24 * hour
  now = Time.now.to_i
  max = 100

  data = Kennel::Progress.progress "analyzing #{monitors.size} monitors" do
    Kennel::Utils.parallel monitors do |monitor|
      events = Kennel.send(:api).list("monitor/#{monitor[:id]}/search_events", from_ts: now - interval, to_ts: now, count: max, start: 0)
      next if events.empty?

      duration = now - (events.last.fetch(:date_detected) / 1000)
      amount = events.size
      frequency = amount * (hour / duration.to_f)
      [monitor, frequency, events]
    end.compact
  end

  # spammy first
  data.sort_by! { |_, frequency, _| -frequency }

  data.each do |m, frequency, events|
    groups = events.group_by { |e| e.fetch(:alert_type) }
    groups.sort_by(&:first) # sort by alert_type

    puts m.fetch(:name)
    puts "https://zendesk.datadoghq.com/monitors/#{m.fetch(:id)}"
    puts "Frequency: #{frequency.round(2)}/h"
    groups.each do |type, grouped_events|
      puts "#{type}: #{grouped_events.size}x"
    end
    puts
  end
end

Ruby Capture Stdout Without $stdout

Rake’s sh for example cannot be captured with the normal $stdount = StringIO.new trick, but using some old ActiveSupport code that uses reopen with a Temfile seems to do the trick.

# StringIO does not work with rubies `system` call that `sh` uses under the hood, so using Tempfile + reopen
# https://grosser.it/2018/11/23/ruby-capture-stdout-without-stdout
def capture_stdout
  old_stdout = STDOUT.dup
  Tempfile.open('tee_stdout') do |capture|
    STDOUT.reopen(capture)
    STDOUT.sync = true
    yield
    capture.rewind
    capture.read
  end
ensure
  STDOUT.reopen(old_stdout)
end

Finding latest AWS ECR image in all repositories

We have a lot of ECR repos that needed to be taken down, so I ran this little aws-cli + ruby + jq script to sanity check if all the images are old

require 'json'
repos = `aws ecr describe-repositories | jq .repositories[].repositoryName --raw-output`.split("\n")
pushed = repos.map do |repo|
  out = `aws ecr describe-images --repository-name #{repo} --output json --query 'sort_by(imageDetails,& imagePushedAt)[*]'`
  print '.'
  next unless image = JSON.parse(out).first
  Integer(image["imagePushedAt"])
end.compact
puts Time.at(pushed.max)

Ruby: Waiting for one of multiple threads to finish

We build a small project that watches multiple metrics until one of them finds something, I found ThreadsWait in the stdlib and it was easy to use it. Also added error re-raising so the threads do not die silently and cleanup.

require 'thwait'

def wait_for_first_block_to_complete(*blocks)
  threads = blocks.map do |block|
    Thread.new do
      block.call
    rescue StandardError => e
      e
    end
  end
  waiter = ThreadsWait.new(*threads)
  value = waiter.next_wait.value
  threads.each(&:kill)
  raise value if value.is_a?(StandardError)
  value
end

wait_for_first_block_to_complete(
  -> { sleep 5 }, -> { sleep 1 }, -> { sleep 2 }
) # will stop after 1 second

 

Reading journald kernel logs from inside a kubernetes pod

We wanted a watcher that alerts us when bad kernel things happen and were able to deploy that as a DaemonSet using Kubernetes 🙂

  • Use a Debian base image (for example ruby:2.5-stretch)
  • Run as root user or as user that can read systemd logs like systemd-journal
  • Mount /run/log/journal
    spec:
      containers:
      - name: foo
        ...
        volumeMounts:
        - name: runlog
          mountPath: /run/log/journal
          readOnly: true
      volumes:
      - name: runlog
        hostPath:
          path: /run/log/journal
  • Use systemd-journal to read the logs
    require 'systemd/journal'
    journal = Systemd::Journal.new
    journal.seek(:tail)
    journal.move_previous
    journal.filter(syslog_identifier: 'kernel')
    journal.watch { |entry| puts entry.message }