Kubernetes Changelog from Audit log

Often we want to ask “what exactly changed about this resource ?” especially during or after an incident.
The answer usually is “check the audit log”.
But the audit log is very verbose and hard to scan, so here is a ruby rake task to parse the audit log and spit out a nice diff. (Customize to read from the log source of your choice)

require 'uri'
require 'cgi'
require 'time'
require 'json'
require 'hashdiff' # gem install hashdiff
require 'kennel' # gem install kennel

class Logs
  class << self
    # does not flatten arrays, but we don't need this here
    def flatten_hash(hash)
      hash.each_with_object({}) do |(k, v), h|
        if v.is_a? Hash
          flatten_hash(v).map do |h_k, h_v|
            h["#{k}.#{h_k}".to_sym] = h_v
          end
        else
          h[k] = v
        end
      end
    end

    def clean_for_diff(object, ignore_status:)
      # datadog turns labels like metadata.labels.foo.bar into a nested foo: bar hash
      object.replace flatten_hash object

      # general
      object.delete :"metadata.annotations.deployment.kubernetes.io/revision"
      object.delete :"metadata.annotations.kubectl.kubernetes.io/last-applied-configuration"
      object.delete :"metadata.generation"
      object.delete :"metadata.managedFields"
      object.delete :"metadata.resourceVersion"
      object.delete :"spec.template.metadata.creationTimestamp"

      # status
      if ignore_status
        object.delete_if { |k, _| k.start_with? "status" }
      else
        object.delete :"status.observedGeneration"
      end
    end
  end
end

namespace :logs do
  desc "show change history for a given resource by parsing the audit log CLUSTER= RESOURCE= [NAMESPACE=] NAME= [DAYS=7] [STATUS=ignore|include]"
    cluster = ENV.fetch("CLUSTER")
    resource = ENV.fetch("RESOURCE")
    name = ENV.fetch("NAME")
    namespace = ENV["NAMESPACE"]
    ignore_status = ((ENV["STATUS"] || "ignore") == "ignore")
    days = Integer(ENV["DAYS"] || "7")

    # get current version to be able to diff the latest update
    result = `kubectl --context #{cluster} get #{resource} #{name} #{"-n #{namespace}" if namespace} -o json --ignore-not-found`
    raise unless $?.success?
    if result == ""
      warn "Resource not found, assuming it was deleted"
      current = nil
    else
      current = Logs.clean_for_diff(JSON.parse(result, symbolize_names: true), ignore_status:)
    end

    # build log url
    url = <whatever your log system is>

    # say what we are looking at
    warn "Inspecting #{days} days of logs #{ignore_status ? "ignoring" : "including"} status changes."
    warn url

    # produce diff from logs
    verb_colors = { "update" => :yellow, "delete" => :red, "patch" => :cyan, "create" => :green }
    printer = Kennel::AttributeDiffer.new
    list_logs(url) do |line| # build this method for whatever your log system is
      status = line.dig(:attributes, :http, :status_code)
      next if status >= 300

      # print what happened
      verb = line.dig(:attributes, :verb)
      time = line.dig(:attributes, :requestReceivedTimestamp).sub(/\..*/, "")
      user = line.dig(:attributes, :user, :username)
      puts(Kennel::Console.color(verb_colors.fetch(verb), "#{time} #{verb} by #{user}"))
      next if verb == "delete"

      # print diff
      previous = Logs.clean_for_diff(line.dig(:attributes, :responseObject), ignore_status:)
      unless current # support looking at deleted resources
        current = previous
        next
      end
      diff = Hashdiff.diff(previous, current, use_lcs: false, strict: false, similarity: 1)
      diff.each { |l| puts printer.format(*l) }
      current = previous
    end

And you get a nice diff like this

Verify Pagerduty reaches On-Call by Cron

We had a few incidents were on-call devs missed their calls because of various spam-blocking setups or “do not disturb” settings.
We now run a small service that test-notifies everyone once a month to make sure notifications go through. Notifications go out shortly before their ‘do not disturb’ stops so we do not wake them in the middle of the night, but still have a realistic situation.
Our setup has more logging/stats etc, but it goes something like this:

# configure user schedule
require 'yaml'
users = YAML.load <<~YAML
- name: "John Doe"
  id: ABCD
#  cron: "* * * * * America/Los_Angeles" # every minute ... for local testing
  cron: "55 6 * * 2#1 America/Los_Angeles" # every first Tuesday of the month at 6:55am
# ... more users here
YAML

# code to notify users
require 'json'
require 'faraday'
def create_test_incident(user)
  connection = Faraday.new
  response = nil
  2.times do
    response = connection.post do |req|
      req.url "https://api.pagerduty.com/incidents"
      req.headers['Content-Type'] = 'application/json'
      req.headers['Accept'] = 'application/vnd.pagerduty+json;version=2'
      req.headers['From'] = 'realusers@email.com' # incident owner 
      req.headers['Authorization'] = "Token token=#{ENV.fetch("PAGERDUTY_TOKEN")}"
      req.body = {
        incident: {
          type: "incident",
          title: "Pagerduty Tester: Incident for #{user.fetch("name")}, press resolve",
          service: {
            id: ENV.fetch("SERVICE_ID"),
            type: "service_reference"
          },
          assignments: [{
            assignee: {
              id: user.fetch("id"),
              type: "user_reference"
            }
          }]
        }
      }.to_json
    end
    if response.status == 429 # pagerduty rate-limits to 6 incidents/min/service
      sleep 60
      next
    end
    raise "Request failed #{response.status} -- #{response.body}" if response.status >= 300
  end
  JSON.parse(response.body).fetch("incident").fetch("id")
end

# run on a schedule (no threading / forking)
require 'serial_scheduler'
require 'fugit'
scheduler = SerialScheduler.new
users.each do |user|
  scheduler.add("Notify #{user.fetch("name")}", cron: user.fetch("cron"), timeout: 10) do
    user_id = user.fetch("id")
    incident_id = PagerdutyTester.create_test_incident(user)
    puts "Created incident for #{user_id} https://#{ENV.fetch('SUBDOMAIN')}.pagerduty.com/incidents/#{incident_id}"
  rescue StandardError => e
    puts "Creating incident for #{user_id} failed #{e}"
  end
end
scheduler.run

Rails Sum ActiveSupport Instrument Times

We wanted to show the sum of multiple ActiveSupport notifications during a long process, so here is a tiny snipped to do that, an advanced version is used in Samson

# sum activesupport notification duration for given metrics
def time_sum(metrics, &block)
  sum = Hash.new(0.0)
  add = ->(m, s, f, *) { sum[m] += 1000 * (f - s) }
  metrics.inject(block) do |inner, m|
    -> { ActiveSupport::Notifications.subscribed(add, m, &inner) }
  end.call
  sum
end

time_sum(["sql.active_record"]) { 10.times { User.first } }
# {"sql.active_record" => 10.3}