![]() ![]() 8:35 There’s a couple of ways this can be done.8:05 So we ran a game day and filled up our disks, and made sure that the log rotation happened, that things continued to operate as expected.7:45 We wrote a blog post about that because we ran into this exact problem ourselves.So you could run an experiment on, say, log files filling up a drive? 7:15 Often what we see is that our assumptions were incorrect and there are some subtleties as to what happens.7:00 At Gremlin we provide the tooling and planning the game day, but we let the engineers drive.6:50 The next step is to execute the experiment.6:45 Just knowing what could go wrong, and what you think could happen, helps you learn a lot.6:35 We can then look at what would happen if we break a connection, and predict how it would behave.6:30 Once we’ve drawn out all of the network connections and how they connect to each other.6:20 Often the first iteration doesn’t have all the details - where the configuration is stored, for example.6:15 We have engineers whiteboard out their service.5:55 A game day is group of engineers in a room and thinking about what could go wrong.5:35 It’s not just the technology - it’s the people as well.5:20 We don’t think that’s the right way to go - thoughtful experiments are much better both at reviewing where the system is weak but also to train your teams.5:15 However, because of Chaos Monkey people think that the only way to do chaos engineering is to randomly break stuff.5:10 We love Chaos Monkey - it was a pioneering software release.4:55 It’s about thoughtful, planned experiments that teach us how things can go wrong. ![]() 4:30 It’s a bit like inoculation from a virus - we cause a little bit of harm, but similar to what would happen normally.4:05 We want to make software stronger.3:40 What we’re really focussed on is helping companies be more resilient to failure.3:30 Failure as a service resonates with engineers, but it is a means to an end.2:15 We closed our series A round earlier this year - 7 1/2m dollars.2:00 Gremlin - our failure as a service - is now publicly available, and is enterprise ready.1:45 Yesterday was our big launch day - we publicly spoke about what we’ve been building.The service is written in Java running on AWS, and the web interface is written in Ember. Rust does deterministic memory management, and failure is a first class citizen in Rust the signatures in Rust have a result. The Gremlin client is written in Rust.Examples of the Gremlins you can inject include things that consume resources (for example memory, CPU, disk-space, IO overhead), things that change the state of the host or the VM (kill processes, time travel such as a leap second or a clock skew, rebooting hosts and containers), and things that change the network (we can’t resolve DNS, the service is slow and so on).There are open source alternatives but it’s the first enterprise product to our knowledge. Gremlin, which is based on the idea of chaos engineering or “resilience as a service” is now available as a product, with early customers including Twillio, Expedia, Confluent and Remind.In general the advice is to minimise your blast radius start with the smallest thing that you can do that will teach you something about your system - a single instance, a single container - once you’ve got faith in that scale up and repeat.An example might be a Game Day where we get a group of engineers in a room and whiteboard out what can wrong, and then we run experiments to test our assumptions such as what happens when your log files fill up the disk. Chaos Engineering is about thoughtful, planed experiments. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |