Phil Dibowitz, systems engineer at Facebook, led its Chef team through a three-year process to rethink configuration management to scale on par with the Web-scale company's data centers. SearchDataCenter.com spoke with Dibowitz after all of Facebook's infrastructure, as well as its backend IT, moved to Chef and the team turned off CFEngine servers.
Dibowitz and his team helped individual service owners take on their own cookbooks and guided the company-wide conversion to Chef. It took three years, Dibowitz said, not because of the technological change, but because of the structural change, with software engineers taking on ownership of the ops side.
With the Chef DevOps migration complete, Dibowitz and his team are working on operating systems at Facebook. The problem is that OSes are set up once and never truly owned -- Dibowitz wants to change how OSes are installed and managed. He calls it a "natural dovetail with Chef." The tools, models and workflows hammered out with Chef will come into play for OS speed and management improvements.
What tools for configuration management and automation have you added, or gotten rid of, as your DevOps model matures?
Phil Dibowitz: We do exclusively systems configuration on Chef. App configuration is a different system. We had to draw a hard line and say app stuff stays here; systems stuff stays here. We already had great systems for deploying and configuring apps that was Facebook-aware. Apps have different requirements and different testing needs. While Chef can do one, the other, or both, it was better for us to target one and do it really well.
We released a bunch of tools [in early 2014] on GitHub and RubyGems: Taste Tester for testing, [a rewritten version of] Grocery Delivery to distribute cookbooks to Chef Servers and Chef Server Stats [a small utility to pull monitoring information from Chef servers]. They're for how we use Chef and show why we do things that way.
Phil Dibowitz, Systems Engineer, Facebook
I was also the first external committer to the Chef code base. Chef now offers maintainership to the community. So, I became an official Chef maintainer. I contributed more stuff for Chef Client, such as a feature to install multiple packages in one resource.
We released a few cookbooks that Facebook uses internally. They're there as a good example for people to touch and play with. I'd like to release a bunch more as soon as I can get them cleaned up and ready for the world.
How does Facebook benefit from open source community contributions?
Dibowitz: For Grocery Delivery and Taste Tester, we got bug fixes and feature enhancements from the community. That's really helpful; we gained additional support features for example. When we talk to the community, there's a shared language. You can discuss ideas.
Facebook has become a fairly big name amongst Chef users, and I try to make it really clear that the things we do are applicable to shops of any size. We had to solve these problems, and you can use what we learned to solve them at smaller scale for you too. We work with Yahoo!, but also tons of small companies, banks, big enterprises, Web 2.0 startups, Chef enthusiasts with home deployments. ... We have the perspective of several different ways to use Chef and do configuration management.
Editor's Note: In June 2014, Facebook announced plans to open source the forwarding agent for FBOSS, its SDN switch operating system, and Wedge, its bare-metal switching hardware.
How has the concurrent evolution of Open Compute Project shaped how you do things at Facebook?
Dibowitz: Early on, when talking about the possibility of building our own switching stack, our pie-in-the-sky scenario was being able to treat them as much like a server as possible, but with a real switch and as fast as that. How do we handle all the stuff that isn't the switch itself, so it all just works? How do we make networking work like it hasn't before?
If there was going to be a part of the device that looks like a Linux box, then you have to maintain all the stuff, like syslog. So with Chef on those boxes immediately that problem is solved.
Since the dawn of time, the problem with managing network devices is that it's really hard to automate configuration. The OS fundamentally assumes a lot of state in the admin's head. If you're adding six rules between firewalls, your interim state might not be safe, might lock you out, etc. It's slow, hard and error prone. Then there's the typical Linux process where you write out the config file, it reads it and you're done. ... That's hard on a legacy environment of how networking has always worked.
We don't use Chef for the the actual switch to push commands down to the ASICs, but the concept is very similar to the way Chef works -- more open source to come there. We built this the same way we build everything else: Start with a small base that works and iterate. FBOSS and Wedge provide a base. We now have a basic set of utilities for routing and
At least meet the minimum, and strive for the ideal. This is a DevOps model. ... For the Chef migration, we supported the commonality, and people sent feature requests. We prioritized the most standard projects then did harder or unique and smaller migrations. That same model works no matter what you're building.
How are the IT organizations joining you in DevOps different today than they were 2 years ago?
Dibowitz: There are more companies, and different kinds of companies. I couldn't have imagined that three years ago at my first ChefConf that there'd be massive banks asking about continuous delivery. There are a lot of banks, investment firms, old-world traditional companies, aerospace companies and so on.
I still get the same technical questions, such as "How do you implement a cookbook to do X?" Or, "How do you train people?" But I also get crazy new ones, like "How does Chef interplay with audits?" And, "How does your PCI auditor feel about it?"
Now there's also a lot of talking about the people side. How do you build the right team, foster the right attitude and ensure that people are being responsible and reliable while making all these changes? This might be because more big companies are embracing DevOps.
What are the avoidable mistakes you wish you'd known when starting to work as a DevOps company?
Dibowitz: I'd break them down into technical and cultural mistakes.
Cultural: Everyone should do code reviews. Someone has to review and accept your code before you can commit it. Big companies already buy into that model. GitHub users buy into it as well. But tons of companies say, "Oh, we're too small" or "We don't have time." And what happens is that you spend far more time backing out of things and working around processes. With code reviews, you collaborate more and you gain better perspective on the simpler or more repeatable path. Use GitHub or Phabricator to do that.
Technical: Use a correctness tool to find errors, syntax problems. Foodcritic is one correctness tool, or use Rubocop linter. With a human checking syntax and style, there are too many rounds of reviews and missed errors. Those tools find the nitpicky things and you focus just on making the change you want to make. Human can focus on approach and overall code quality. That feedback is what humans are good at and people are excited to receive it.
I wish we had used those sooner. I would have delayed the project to do it, because it makes the experience so much better. (+)