AWS Rolling Replace Kubernetes Nodes without App errors/downtimes

We do a few things to keep the apps on our clusters from crashing during rolling instance replacements, so we can keep the overall noise and deployment errors low. I just wanted to write them up since some of them are not very intuitive and the combination of all of them makes for a really safe cluster updates.

  • Setup PodDisruptionBudget for all deployments so they stay available no matter what combination of nodes we take down
  • When  AWS AutoScalingGroup wants to update (before it takes down instances) we sends a autoscaling hook to SQS and drain the instance, then signal to continue (separate lille app we built but very simple overall)
  • When a new instances is booted, we taint it so no pod lands on a broken node (infrastructure pods with tolerations like CNI are booted quicker too since the node only has a few pods in to begin with)
  • Aws rollingupdate waits for the “all good” signal that we send when the node is finally ready and only then we untaint (nice side effect is also that nodes that never get ready stop the RollingUpdate and causes a Rollback)
  • We determine ready with custom plugins in node-problem-detector which has a few issues but overall works alright.

Rails Sum ActiveSupport Instrument Times

We wanted to show the sum of multiple ActiveSupport notifications during a long process, so here is a tiny snipped to do that, an advanced version is used in Samson

# sum activesupport notification duration for given metrics
def time_sum(metrics, &block)
  sum = Hash.new(0.0)
  add = ->(m, s, f, *) { sum[m] += 1000 * (f - s) }
  metrics.inject(block) do |inner, m|
    -> { ActiveSupport::Notifications.subscribed(add, m, &inner) }
  end.call
  sum
end

time_sum(["sql.active_record"]) { 10.times { User.first } }
# {"sql.active_record" => 10.3}

Validating ActiveRecord Backlinks exist

Whenever a new association is added usually we also need the opposite association to ensure things get cleaned up properly during deletion.
To never forget this and audit the current state, these two tests can help.

  def all_models
    models = Dir["app/models/**/*.rb"].grep_v(/\/concerns\//)
    models.size.must_be :>, 20
    models.each { |f| require f }
    ActiveRecord::Base.descendants
  end

  it "explicity defines what should happen to dependencies" do
    bad = all_models.flat_map do |model|
      model.reflect_on_all_associations.map do |association|
        next if association.is_a?(ActiveRecord::Reflection::BelongsToReflection)
        next if association.options.key?(:through)
        next if association.options.key?(:dependent)
        "#{model.name} #{association.name}"
      end
    end.compact
    assert(
      bad.empty?,
      "These associations need a :dependent defined (most likely :destroy or nil)\n#{bad.join("\n")}"
    )
  end

  it "links all dependencies both ways so dependencies get deleted reliably" do
    bad = all_models.flat_map do |model|
      model.reflect_on_all_associations.map do |association|
        next if association.name == :audits
        next if association.options.fetch(:inverse_of, false).nil? # disabled on purpose
        next if association.inverse_of
        "#{model.name} #{association.name}"
      end
    end.compact
    assert(
      bad.empty?,
      <<~TEXT
        These associations need an inverse association.
        For example project has stages and stage has project.
        If automatic connection does not work, use `:inverse_of` option on the association.
        If inverse association is missing AND the inverse should not destroyed when dependency is destroyed, use `inverse_of: nil`.
        #{bad.join("\n")}
      TEXT
    )
  end