Biz & IT —

Netflix finishes its massive migration to the Amazon cloud

After move to Amazon, only the DVD business still uses traditional data center.

Netflix finishes its massive migration to the Amazon cloud

Netflix has been moving huge portions of its streaming operation to Amazon Web Services (AWS) for years now, and it says it has finally completed its giant shift to the cloud. “We are happy to report that in early January of 2016, after seven years of diligent effort, we have finally completed our cloud migration and shut down the last remaining data center bits used by our streaming service,” Netflix said in a blog post that it plans to publish at noon Eastern today. (The blog should go up at this link.)

Netflix operates “many tens of thousands of servers and many tens of petabytes of storage” in the Amazon cloud, Netflix VP of cloud and platform engineering Yury Izrailevsky told Ars in an interview.

Netflix had earlier planned to complete the shift by the end of last summer.

“Billing and payments was the last remaining piece. We wanted to make sure we do it right; obviously, there is a lot of privacy concerns around customer data,” Izrailevsky said. Previously, the applications and data related to billing and payments were in a cage Netflix rented at a colocation facility.

With this last piece finished, Netflix’s streaming business no longer operates any of its own data center space. But not everything is in Amazon.

Netflix operates its own content delivery network (CDN) called Open Connect. Netflix manages Open Connect from Amazon, but the storage boxes holding videos that stream to your house or mobile device are all located in data centers within Internet service providers' networks or at Internet exchange points, facilities where major network operators exchange traffic. Netflix distributes traffic directly to Comcast, Verizon, AT&T, and other big network operators at these exchange points.

Once a customer hits the “play” button, video is delivered from one of those sites. But all the applications and data needed to manage everything a customer does before clicking “play”—such as signing up for the service or searching videos—is running in the Amazon cloud. All the customer-facing systems for the streaming business are thus in Amazon or the Open Connect storage boxes. “All the search, personalization, all the business logic, all the data processing that enables the streaming experience, the 100 different  applications and services that make up the streaming application, they live in AWS,” Izrailevsky said.

Most of the technology needed to manage employees of the streaming business is also in Amazon, though the company also uses some software-as-a-service applications such as Workday, Izrailevsky said.

Remember DVDs?

There’s one other exception to Netflix’s shift to the cloud. While the streaming business has gone all cloud, the old DVD mailing business has not. “Our DVD business is still relying on the data center [colocation facility] for all of their operations,” Izrailevsky said. The DVD and streaming businesses are run separately, with their own systems and processes. The DVD business is “stable” and well served by its current setup, Izrailevsky said.

In other words, the DVD business isn’t experiencing the massive growth that requires the ability to scale up as needed. Netflix streaming, meanwhile, just keeps growing and accounts for more than one-third of all North American fixed Internet traffic during peak viewing hours, according to the latest Sandvine Global Internet Phenomena Report. “Supporting such rapid growth would have been extremely difficult out of our own data centers; we simply could not have racked the servers fast enough,” Netflix’s blog post says. “Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible.”

Even though the DVD business remains in a traditional data center, it was actually an outage in the DVD operation that spurred Netflix’s shift to the cloud for its streaming service. For three days in August 2008, Netflix couldn’t ship DVDs to customers because of a major database corruption. As annoying as that was, Netflix knew it would be even worse if something like that happened to the streaming product. Customers could still watch the DVDs they had during that outage. But with streaming, a three-day outage would mean no video watching, period. Netflix had launched its streaming service in 2007 and knew there was potential for growth, “so we wanted to get ahead of that,” Izrailevsky said. Besides improved availability, Netflix says using Amazon allowed it to meet increasing demand at a lower price than it would have paid if it still operated its own data centers.

Netflix declined to say how much it pays Amazon, but says it expects to "spend over $800 million on technology and development in 2016," up from $651 million in 2015. Netflix spends less on technology than it does on marketing, according to its latest earnings report.

Netflix’s Simian Army

The big question on your mind might be this: What happens if the Amazon cloud fails?

That's one reason it took Netflix seven years to make the shift to Amazon. Instead of moving existing systems intact to the cloud, Netflix rebuilt nearly all of its software to take advantage of a cloud network that "allows one to build highly reliable services out of fundamentally unreliable but redundant components," the company says. To minimize the risk of disruption, Netflix has built a series of tools with names like “Chaos Monkey,” which randomly takes virtual machines offline to make sure Netflix can survive failures without harming customers. Netflix’s “Simian Army” ramped up with Chaos Gorilla (which disables an entire Amazon availability zone) and Chaos Kong (which simulates an outage affecting an entire Amazon region and shifts workloads to other regions).

Amazon’s cloud network is spread across 12 regions worldwide, each of which has availability zones consisting of one or more data centers. Netflix operates primarily in the Northern Virginia, Oregon, and Dublin regions, but if an entire region goes down, “we can instantaneously redirect the traffic to the other available ones,” Izrailevsky said. "It's not that uncommon for us to fail over across regions for various reasons."

Years ago, Netflix wasn't able to do that, and the company suffered a streaming failure on Christmas Eve in 2012, when it was operating in just one Amazon region. “We've invested a lot of effort in disaster recovery and making sure no matter how big a failure that we're able to bring things back from backups,” he said.

Netflix has multiple backups of all data within Amazon.

“Customer data or production data of any sort, we put it in distributed databases such as Cassandra, where each data element is replicated multiple times in production, and then we generate primary backups of all the data into S3 [Amazon’s Simple Storage Service],” he said. “All the logical errors, operator errors, or software bugs, many kinds of corruptions—we would be able to deal with them just from those S3 backups.”

What if all of Netflix's systems in Amazon went down? Netflix keeps backups of everything in Google Cloud Storage in case of a natural disaster, a self-inflicted failure that somehow takes all of Netflix's systems down, or a “catastrophic security breach that might affect our entire AWS deployment,” Izrailevsky said. “We've never seen a situation like this and we hope we never will.”

But Netflix would be ready in part thanks to a system it calls “Armageddon Monkey,” which simulates failure of all of Netflix’s systems on Amazon. It could take hours or even a few days to recover from an Amazon-wide failure, but Netflix says it can do it. Netflix pointed out that Amazon isolates its regions from each other, making it difficult for all of them to go out simultaneously.

“So that's not the scenario we're planning for. Rather it's a catastrophic bug or data corruption that would cause us to wipe the slate clean and start fresh from the latest good back-up,” a Netflix spokesperson said. “We hope we will never need to rely on Armageddon Monkey in real life, but going through the drill helps us ensure we back up all of our production data, manage dependencies properly, and have a clean, modular architecture; all this puts us in a better position to deal with smaller outages as well.”

Netflix declined to say where it would operate its systems during an emergency that forced it to move off Amazon. "From a security perspective, it'd be better not to say," a spokesperson said.

Netflix has released a lot of its software as open source, saying it prefers to collaborate with other companies than keep secret the methods for making cloud networks more reliable. “While of course cloud is important for us, we're not very protective of the technology and the best practices, we really hope to build the community,” Izrailevsky said.

Channel Ars Technica