I work at a mostly AWS shop, and while we still have services on raw EC2, nearly all of our new development is on Amazon ECS in docker. I like docker because it provides a unified unit of operation (a container) that makes it easy to build shared tooling regardless of language/application. It also lets you reproduce your applications local in the same environment they run remote, as well as starting fast and deploying fast.

However, many services run on a shared ECS node in a cluster, and so while things like Chaos Monkey may run around turning nodes off it’d be nice to have a little less of an impact during working hours while still being able to stress recovery and our alerting.

This is actually pretty easy though with a little docker container we call The Beast. All the beast does is run on a ECS Scheduled event every 15-30 minutes from 10am - 3pm PST (we have teams east and west coasts) and the beast kills a random container from whatever cluster node its on. It doesn’t do a lot of damage, but it does test your fault tolerance.

Here’s The Beast:

  
#!/usr/bin/env ruby

require 'json'  
require 'pp'

class Hash  
 def extract\_subhash(\*extract)  
 h2 = self.select{|key, value| extract.include?(key) }  
 self.delete\_if {|key, value| extract.include?(key) }  
 h2  
 end  
end

puts "UNLEASH THE BEAST!"

ignore\_image\_regex = ENV["IGNORED\_REGEX"]

raw = "[#{`docker ps --format ''`.lines.join(',')}]"

running\_services = JSON.parse(raw).map { |val| val.extract\_subhash("ID", "Image")}

puts running\_services

puts "Ignoring regex #{ignore\_image\_regex}"

if ignore\_image\_regex && ignore\_image\_regex.length \> 0  
 running\_services.delete\_if {|value|  
 /#{ignore\_image\_regex}/ === value["Image"]  
 }  
end

if !running\_services || running\_services.length == 0  
 puts "No services to kill"

Process.exit(0)  
end

puts "Bag of services to kill: "

to\_kill = running\_services.sample

puts "Killing #{pp to\_kill}"

`docker kill #{to_kill["ID"]}`

prng = Random.new

quips = [  
 "Dont fear the reaper",  
 "BEAST MODE",  
 "You been rubby'd",  
 "Pager doody"  
]

puts "#{quips[prng.rand(0..quips.length-1)]}"  

Beast supports a regex of ignored images (so critical images like the ecs_agent and itself) can be marked as ignore. This can also be used to update the beast to allow it to ignore services temporarily/etc.

We deploy The Beast with terraform, the general task definition looks like:

  
[  
 {  
 "name": "the-beast",  
 "image": "${image}:${version}",  
 "cpu": 10,  
 "memory": 50,  
 "essential": true,  
 "logConfiguration": {  
 "logDriver": "awslogs",  
 "options": {  
 "awslogs-group": "${log\_group}",  
 "awslogs-region": "${region}",  
 "awslogs-stream-prefix": "the-beast"  
 }  
 },  
 "environment": [  
 {  
 "name": "IGNORED\_REGEX", "value": ".\*ecs\_agent.\*|.\*the-beast.\*"  
 }  
 ],  
 "mountPoints": [  
 { "sourceVolume": "docker-socket", "containerPath": "/var/run/docker.sock", "readOnly": true }  
 ]  
 }  
]  

And the terraform:

  
resource "aws\_ecs\_task\_definition" "beast\_rule" {  
 family = "beast-service"  
 container\_definitions = "${data.template\_file.task\_definition.rendered}"

volume {  
 name = "docker-socket"  
 host\_path = "/var/run/docker.sock"  
 }  
}

data "template\_file" "task\_definition" {  
 template = "${file("${path.module}/files/task-definition.tpl")}"

vars {  
 version = "${var.beast-service["version"]}"  
 region = "${var.region}"  
 image = "${data.terraform\_remote\_state.remote\_env\_state.docker\_namespace}/the-beast"  
 log\_group = "${var.log-group}"  
 }  
}

resource "aws\_cloudwatch\_event\_target" "beast\_scheduled\_job\_target" {  
 target\_id = "${aws\_ecs\_task\_definition.beast\_rule.family}"  
 rule = "${aws\_cloudwatch\_event\_rule.beast\_scheduled\_job.name}"  
 arn = "${data.aws\_ecs\_cluster.default\_cluster.id}"  
 role\_arn = "${data.aws\_iam\_role.ecs\_service\_role.arn}"  
 ecs\_target {  
 task\_count = 1  
 task\_definition\_arn = "${aws\_ecs\_task\_definition.beast\_rule.arn}"  
 }  
}

resource "aws\_cloudwatch\_event\_rule" "beast\_scheduled\_job" {  
 name = "${aws\_ecs\_task\_definition.beast\_rule.family}"  
 description = "Beast kills a container every 30 minutes from 10AM to 3PM PST Mon-Thu"  
 schedule\_expression = "cron(0/30 18-23 ? \* MON-THU \*)"  
 is\_enabled = false  
}

resource "aws\_cloudwatch\_log\_group" "beast\_log\_group" {  
 name = "${var.log-group}"  
}  

We can log to cloudwatch and correlate back information if a service was killed by the best as well. It’s important to note that you need to mount the docker socket for beast to work, since it needs docker to run. A sample dockerfile looks like:

  
FROM ubuntu:xenial

RUN apt-get update && apt-get install -y ruby-full docker.io build-essential

RUN gem install json

ADD beast.rb /app/beast.rb

RUN chmod +x /app/beast.rb

ENTRYPOINT "/app/beast.rb"  

It’s bare bones, but it works, and the stupid quips at the end always make me chuckle.