Monday, April 9, 2012

Scalability lessons from Google, YouTube, Twitter, Amazon, eBay, Facebook and Instagram

I've gathered together in one place a few lessons in scalability from seven of the most highly trafficked websites around. I've grabbed this primarily from various articles on the excellent High Scalability Blog and have summarized the main points from each company below.

Here are some common ideas I've noticed across all seven companies ...

  1. Keep it simple - complexity will come naturally over time.
  2. Automate everything, including failure recovery.
  3. Iterate your solutions - be prepared to throw away a working component when you want to scale it up to the next level.
  4. Use the right tool for the job, but don't be afraid to roll your own solution.
  5. Use caching, where appropriate.
  6. Know when to favor data consistency over data availability, and vice versa.


Reliable storage
Reliable scalable storage is a core need of any application. The Google File System (GFS) is Google's core storage platform - its a large distributed log structured file system into which they throw a lot of data. Why did they build it instead of using something off the shelf? Because they control everything about it, and it's the platform that distinguishes them from everyone else. From GFS, they gain: high reliability across data centers, scalability to thousands of network nodes, huge read/write bandwidth, support for large blocks of data which are gigabytes in size and an efficient distribution of operations across nodes to reduce bottlenecks.

Infrastructure as a competitive advantage
Google can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach by treating infrastructure as an expense. Each group will use completely different technologies and there will be little planning and commonality of how to build systems.

Build applications on top of a platform
An under appreciated advantage of a platform approach is that junior developers can quickly and confidently create robust applications. If every project needs to create the same distributed infrastructure you'll quickly run into difficulty because the people who know how to do this are relatively rare. Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.

Automation and recovery
Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.

Create a Darwinian infrastructure
Perform a time consuming (CPU bound) operation in parallel and take the winner. This is especially useful when you have spare CPU capacity but limited IO.

Don't ignore the Academy
Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.

Consider data compression
Compression is a good option when you have a lot of CPU to throw around and limited IO.


Keep it simple
Look for the simplest thing that will address the problem space. There are lots of complex problems, but the first solution doesn’t need to be complicated. The complexity will come naturally over time. Choose the simplest solution possible with the loosest guarantees that are practical. The reason you want all these things is you need flexibility to solve problems. The minute you over specify something you paint yourself into a corner. You aren’t going to make those guarantees. Your problem becomes automatically more complex when you try and make guarantees - you leave yourself no way out.

Cheat: know how to fake data
The fastest function call is the one that doesn’t happen. When you have a consistently increasing counter, like a view count, you would need to do a database call every update. Or you could do a call every once in awhile, and update by a random amount in between - people will believe it’s real. Know how to fake data.

Jitter: add entropy back into your system
If your system doesn’t jitter then you get thundering herds of people all requesting the same resource at the same time. For a popular video, they cache things as best they can. The most popular video they might cache for 24 hours. If everything expires at the same time, then every machine will calculate the expiration at the same time. This creates a thundering herd. By jittering you are saying randomly expire between 18-30 hours. That prevents things from happening at the same time and spreads requests out over a long period of time.

Approximate correctness
The state of the system is that which the user sees. If a user can’t tell a part of the system is skewing and inconsistent, then it’s not. If you write a comment and someone loads the page at the same time, they might not get it for half a second, the user who is just reading the page won’t care. The writer of the comment will care though, so you make sure the user who wrote the comment will see it immediately. This allows you to cheat a little bit. When it comes to comments, your system doesn’t have to have globally consistent transactions. That would be super expensive and overkill. Comments are not financial transaction - know when you can cheat.


Implement an API
Twitter's API Traffic is ten times that of Twitter’s Website alone. The API is the most important thing Twitter did to grow their user base. Keeping the service simple allowed developers to build on top of their infrastructure and come up with app ideas that are way better than Twitter could come up with. You can never do all the work your user's can do and you probably won't be as creative. So open up your application and make it as easy as possible for others to integrate your application with theirs.

Use what you know
Twitter uses messaging a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc). Send a message to invalidate a friend's cache in the background instead of doing it all individually, synchronously. Twitter developers were most familiar with Ruby, so they moved from DRb to Starling, a distributed queue written in Ruby. Distributed queues were made to survive system crashes by writing them to disk. In Twitter's experience, most performance improvements come not from the choice of language, but from an application's design.

Know when and what to cache
For example, getting your friends status is complicated. There are security and other issues. So rather than doing a database query, a friend's status is updated in cache instead. It never touches the database. 90% of requests are API requests. So they don't do any page caching for the front-end. Twitter pages are so time sensitive it doesn't do any good. Twitter only caches API requests.

Defend yourself against abuse
Understand how people will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed by bots. Build tools to detect abuse so you can pinpoint when and where they are happening. Be ruthless. Delete them as users.


Amazon's architecture is loosely coupled and built around services. A service-oriented architecture (SOA) gave them the isolation that would allow building many software components rapidly and independently of each other, allowing fast time to market. The application that renders the Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface.

Open up your system with APIs and you'll create an ecosystem around your application. Organizing around services gives you agility - you can now do things in parallel is because the output of everything is a service. Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications.

Know when to favour consistency over availability and vice versa
To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency. Choose a specific approach based on the needs of the service. For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later. When a customer submits an order you favor consistency because several services (credit card processing, shipping and handling, reporting) are simultaneously accessing the data and each rely on their data to be consistent.

Embrace failure
Take it for granted stuff fails, that's reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create a self-healing, self-organizing, lights-out type operation.

Only use what you need
Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity. Not stuck with one particular approach or technology stack. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack. C++ is uses to process requests. Perl/Mason is used to build content.

Base decisions on customer feedback
Use measurement and objective debate to separate the good from the bad. Expose real customers to a choice and see which one works best and to make decisions based on those tests. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do - code it up, let people use it, and see which alternative gives you the results you want.

Scalability as a competitive advantage
Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration.


Partition everything
If you can't split it, you can't scale it. Split everything into manageable chunks by function and data.

Asynchrony everywhere
Connect independent components through event-driven queues and pipelines.

Embrace failure
Monitor everything, provide service even when parts start failing. Minimize and control dependencies, use abstract interfaces and virtualization, components have an SLA, consumers responsible for recovering from SLA violations. Automate everything. Components should automatically adjust and the system should learn and improve itself.

Embrace inconsistency
Pick for each feature where you need to be on the CAP continuum, no distributed transactions, inconsistency can be minimized by careful operation ordering, become eventually consistent through async recovery and reconciliation.

Save all your data
Data drives finding optimization opportunities, predictions, recommendations, so save it all. Know which data is authoritative, which data isn't, and treat it accordingly.

Infrastructure: use the right tool for the right job
Need to maximize utilization of every resource: data (memory), processing (CPU), clock time (latency), power. One size rarely fits all, particularly at scale. Compose from orthogonal, commodity components.


Scaling takes multiple iterations
Solutions often work in the beginning, but you'll have to modify them as you go on - what works in year one may not work later. A good example is photos. Facebook currently serves 1.2 million photos a second. The first generation was built the easy way - don't worry about scaling that much - focus on getting the functionality right. The uploader stored the file in NFS and the meta-data was stored in MySQL. It only worked for the first 3 months but this didn't matter because time to market was the biggest competitive advantage they had. Having the feature was more important than making sure it was a fully thought out, scalable solution. The second generation was optimized - different access patterns were optimized for. Smaller images were accessed more frequently so those became cached. They also started using a CDN (Content Delivery Network). The third generation is an overlay system that creates a file that is a blob stored in the file system. Images are stored in a binary blob and you know the byte offset of the photo in the blob - so there's only one disk IO per photo.

Don't over design a solution - keep it simple
Just use what you need to use as you scale your system out. Figure out where you need to iterate on a solution, optimize something, or completely build a part of the stack yourself. Facebook spent a lot of time trying to optimize PHP, and ended up writing HipHop, a tool to convert PHP into C++. This generated a massive amount of memory and CPU savings. You don't have to do this on day one, but you may have to. Focus on the product first before you write an entire new language.

Choose the right tool for the job, but accept that your choice comes with overhead
If you really need to use Python then go ahead and do so, but realize with that choice there is overhead, usually across deployment, monitoring and operations. If you choose to use a service oriented architecture (SOA) you'll have to build most of the backend yourself and that often takes quite a bit of time. With the LAMP stack you get a lot for free. Once you move away for the LAMP stack, how you do things like service configuration and monitoring is up to you. As you go deeper into the services approach you have to reinvent the wheel.

Get the culture right
Build an environment internally which promotes building the right thing first and then fixing as needed, not worrying about innovating, not worrying about breaking things, thinking big, thinking what is the next thing you need to build after the building the first thing. You can get the code right, you can get the products right, but you need to get the culture right first. If you don't get the culture right then your company won't scale.


Make use of existing cloud infrastructure

Don't reinvent the wheel when you can use solid and proven technology. Instagram runs 100+ instances of Ubuntu 11.04 on Amazon's EC2 cloud computing infrastructure. They also use Amazon's ELB (Elastic Load Balancer), which comprises three NGINX instances, with automatic failure recovery. The photos themselves go straight to Amazon S3 storage and they use Amazon CloudFront as their CDN (Content Delivery Network), which helps with image load times from users around the world (like in Japan, their second most-popular country).

Asynchronous task queuing
When a user decides to share out an Instagram photo to Twitter or Facebook, or when they need to notify a Real-time subscriber of a new photo posted, they push the task into the open source Gearman task management framework. Doing it asynchronously through a task queue means that media uploads can finish quickly, while the ‘heavy lifting’ can run in the background. There's about 200 workers, written in Python, consuming the task queue at any given time, split between the services they share to.

Push notifications
They use an open source Apple Push Notification Service (APNS) provider called pyapns, which is based on Twisted. It has handled over a billion push notifications for Instagram, and they report that its been rock-solid.

Real-time system-wide monitoring
With 100+ EC2 instances, it’s important to keep on top of what’s going on across the board. They use Munin to graph metrics system-wide, which sends an alert if anything is outside of its normal operating range. They write custom Munin plugins, building on top of Python-Munin, to graph metrics that aren’t system-level (for example, signups per minute, photos posted per second, etc). They use Pingdom for external monitoring of the service, and PagerDuty for handling notifications and incidents. For Python error reporting, they use Sentry, an open-source Django app. At any given time, they can sign-on and see what errors are happening across their system, in real time.

Selective use of NoSQL technology (like Redis)
Redis powers the main feed, the activity feed, the sessions system (here’s their Django session backend), and other related systems. All of Redis’ data needs to fit in memory, so they run several Quadruple Extra-Large Memory instances on EC2 for Redis, and occasionally shard across a few Redis instances for any given subsystem.

Sidenote on CAP
Eric Brewer's CAP Theorem or "the three properties of systems"
  • There are three properties of a system: consistency, availability and tolerance to network partitions (partitionability).  
  • You can have at most two of these three properties for any shared-data system. 
    • Consistency: write a value and then when you read the value you get the same value back. In a partitioned system there are time windows where that's not true. 
    • Availability: you may not always be able to write or read. The system may prevent you from writing because it wants to keep the system consistent.
    • Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone.

Follow @dodgy_coder

Subscribe to posts via RSS