Splitting 1 big CSV file into multiple smaller without parsing it

How to turn a 300mb csv into 3x100mb ?
Cut and slice with head/tail and add the header on top!
Code

require 'rake' # for `sh` helper

# split giga-csv into n smaller files
def split_csv(original, file_count)
  header_lines = 1
  lines = Integer(`cat #{original} | wc -l`) - header_lines
  lines_per_file = (lines / file_count.to_f).ceil + header_lines
  header = `head -n #{header_lines} #{original}`

  start = header_lines
  file_count.times.map do |i|
    finish = start + lines_per_file
    file = "#{original}-#{i}.csv"

    File.write(file, header)
    sh "tail -n #{lines - start} #{original} | head -n #{lines_per_file} >> #{file}"

    start = finish
    file
  end
end

Ruby String Naive Split because split is to clever

Problem

"aaa".split('a') == []
"aaa".split('a').join('a') == ""

Standard split is often ‘clever’, but not logical and not symmetric to join. To fix this here is a naive alternative that behaves ‘dumb’ but logical.

Solution

class String
  # https://grosser.it/2011/08/28/ruby-string-naive-split-because-split-is-to-clever/
  # "    ".split(' ') == []
  # "    ".naive_split(' ') == ['','','','']
  # "".split(' ') == []
  # "".naive_split(' ') == ['']
  def naive_split(pattern)
    pattern = /#{Regexp.escape(pattern)}/ unless pattern.is_a?(Regexp)
    result = split(pattern, -1)
    result.empty? ? [''] : result
  end
end

Ruby Hash leaves (leafs)

Get all leaves of a Hash (like recursive values).

Usage
{:x => 1, :y => {:z => 2}}.leaves == [1,2]

Code

class Hash
  # {'x'=>{'y'=>{'z'=>1,'a'=>2}}}.leaves == [1,2]
  def leaves
    leaves = []

    each_value do |value|
      value.is_a?(Hash) ? value.leaves.each{|l| leaves << l } : leaves << value
    end

    leaves
  end
end

Remove default SSH host keys before publishing an AMI

AMIs that have the same ssh host key pairs as other public amis will be made private by amazon to prevent man-in-the-middle attacks, so always remove SSH Host Key Pairs (they will be regenerated with new unique keys automatically)

rm /etc/ssh/ssh_host_dsa_key
rm /etc/ssh/ssh_host_dsa_key.pub
rm /etc/ssh/ssh_host_key
rm /etc/ssh/ssh_host_key.pub
rm /etc/ssh/ssh_host_rsa_key
rm /etc/ssh/ssh_host_rsa_key.pub

Ruby Array.diff(other) difference between 2 Arrays

Diff is defined on Set, but not on Array, so we patch it in… (thanks to reto)
Usage
[1,2] ^ [2,3,4] == [1,3,4]

Code

class Array
  def ^(other)
    result = dup
    other.each{|e| result.include?(e) ? result.delete(e) : result.push(e) }
    result
  end unless method_defined?(:^)
  alias diff ^ unless method_defined?(:diff)
end

puts ([] ^ [1]).inspect          # [1]
puts ([1] ^ []).inspect          # [1]
puts ([1] ^ [2]).inspect         # [1,2]
puts ([] ^ []).inspect           # []
puts ([1,1] ^ [1,1,2,2]).inspect # [1]

The same could be done with (self | other) – (self & other) but would be less performant.