Listing Unmuted Datadog Alerts

Datadogs UI for alerting monitors also shows muted ones without an option to filter, which leads to overhead / confusion when trying to track down what exactly is down.

So we added an alerts task to  kennel  that lists all unmuted alerts and since when they are alerting, it also shows alerts that have no-data warnings even though Datadog UI shows them as not alerting.

bundle exec rake kennel:alerts TAG=team:my-team
Downloading ... 5.36s
Foo certs will expire soon🔒
https://app.datadoghq.com/monitors/123
Ignored cluster:pod1,server:10.215.225.122 39:22:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00
Ignored cluster:pod12,server:10.218.176.123 31:41:00


Foobar Errors (Retry Limit Exceeded)🔒
https://app.datadoghq.com/monitors/1234
Alert cluster:pod2 19:05:16

Finding latest AWS ECR image in all repositories

We have a lot of ECR repos that needed to be taken down, so I ran this little aws-cli + ruby + jq script to sanity check if all the images are old

require 'json'
repos = `aws ecr describe-repositories | jq .repositories[].repositoryName --raw-output`.split("\n")
pushed = repos.map do |repo|
  out = `aws ecr describe-images --repository-name #{repo} --output json --query 'sort_by(imageDetails,& imagePushedAt)[*]'`
  print '.'
  next unless image = JSON.parse(out).first
  Integer(image["imagePushedAt"])
end.compact
puts Time.at(pushed.max)

Ruby: Waiting for one of multiple threads to finish

We build a small project that watches multiple metrics until one of them finds something, I found ThreadsWait in the stdlib and it was easy to use it. Also added error re-raising so the threads do not die silently and cleanup.

require 'thwait'

def wait_for_first_block_to_complete(*blocks)
  threads = blocks.map do |block|
    Thread.new do
      block.call
    rescue StandardError => e
      e
    end
  end
  waiter = ThreadsWait.new(*threads)
  value = waiter.next_wait.value
  threads.each(&:kill)
  raise value if value.is_a?(StandardError)
  value
end

wait_for_first_block_to_complete(
  -> { sleep 5 }, -> { sleep 1 }, -> { sleep 2 }
) # will stop after 1 second

 

Reading journald kernel logs from inside a kubernetes pod

We wanted a watcher that alerts us when bad kernel things happen and were able to deploy that as a DaemonSet using Kubernetes 🙂

  • Use a Debian base image (for example ruby:2.5-stretch)
  • Run as root user or as user that can read systemd logs like systemd-journal
  • Mount /run/log/journal
    spec:
      containers:
      - name: foo
        ...
        volumeMounts:
        - name: runlog
          mountPath: /run/log/journal
          readOnly: true
      volumes:
      - name: runlog
        hostPath:
          path: /run/log/journal
  • Use systemd-journal to read the logs
    require 'systemd/journal'
    journal = Systemd::Journal.new
    journal.seek(:tail)
    journal.move_previous
    journal.filter(syslog_identifier: 'kernel')
    journal.watch { |entry| puts entry.message }

Running multiple commands in docker in parallel

Went through foreman/goreman/forego and all of them either did not:
– support not printing the name
– support killing all when one finishes
– support sending signals to all children

But this does:

## Install parallel with `done` support
RUN \
  curl -sL http://ftp.gnu.org/gnu/parallel/parallel-20180422.tar.bz2 > /tmp/parallel.tar.bz2 && \
  cd /tmp && tar -xvjf /tmp/parallel.tar.bz2 && cd parallel* && \
  ./configure && make install && rm -rf /tmp/parallel*

# stream output and stop all commands if any of them finish/fail
parallel --no-notice --ungroup --halt 'now,done=1' {1} ::: 'sleep 10' 'sleep 20'