Varnish died in peak hours, so we investigated and found out that varnish died only when the esi backend died. We reproduced this scenario by hitting our test-server, that had a simple sleep-10 ttl-0 esi action, with ab.
Error
And look what we found in /var/log/syslog …
Child (10211) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188: Condition((p) != 0) not true. thread = (cache-worker)sp = 0x7fed9e312008 { fd = 14, id = 14, xid = 177614113, client = 127.0.0.1:36835, step = STP_LOOKUP, handling = hash, ws = 0x7fed9e312078 { overflow id = “sess”, {s,f,r,e} = {0x7fed9e312808,,+16364,(nil),+16384}, }, worker = 0x47002bc0 { }, vcl = { srcname = { “input”, “Default”, }, }, },
Searching aroud the interwebs suggested that session workspace should be increased, which helped a bit. Decreasing esi timeout (first_byte_timeout, not connection_timeout) and increasing the threads(min or pool, just getting more) also improved the situation.
Semi-Solution
This is for 2.0.4, there are some esi changes/fixes in trunk, so you might want to retest before simply using these settings in a newer version (e.g. sess_workspace is pretty high and will use more ram than necessary) (do not use 2.0.5 with esi)
Solution
Using SSI infront of Varnish, we could keep almost all configuration the same and instantly every problem was solved!
(we could also remove some unnecessary logic from Varnishs VCL, that normally handled adding/removing cookies after/before ESI)