Every network experiences outages, and Facebook is no exception. Vice President of Infrastructure Engineering Jay Parikh spoke with CNET about how the social network handles outages, and how he handles requests for more servers.
On the topic of outages, he told CNET:
First and foremost, we take this stuff extremely seriously. We want to be fast, and we always want to be up and running. That’s our No. 1, No. 2, No. 3 priority … When we do have a misstep, we spend a lot of time internally really understanding what happened, how we’re not going to let it happen to us again.
I run a meeting every week where we go through all of the issues that happened … We go over the timeline of what happened, the impact to users, what the root cause was, how we fixed it. And we spend the bulk of the time going over what do we need to do to correct it … A lot of companies come up with great ideas and they go into a folder that no one ever sees again. We’ve very diligent on the follow up.
The important thing here is that we do emphasize needing to focus on impact and moving fast. Because of this, we’re going to take on some risk … For us, moving fast is the most important thing, and we just try to minimize or buffer ourselves from the risk of what that means. This doesn’t mean that just because we have 900 million users, that we need to slow down or we’re going to take less risk.
And Parikh told CNET about requests for new servers:
We have a capacity and performance team that is responsible for working with all of the engineering teams in trying to understand their needs. Now, this team isn’t just a bunch of analysts and businesspeople pushing spreadsheets around — they’re actual engineers who know code, who know performance, who know distributed systems and architectures. So they really have a thorough understanding of how a system is working and how a system is supposed to be working.
You don’t just come to this team when you need more servers. You can get more capacity, but that doesn’t always mean getting more servers. We had a funny story where one of our engineer teams came to the capacity team and said, “We need 8,000 servers.” The capacity team said, “OK, why?” and then spent some time understanding what the engineer team was trying to do. After they got through a series of discussions and review, they only needed 16 servers instead of 8,000.