On Designing and Deploying Internet-Scale Services

History / Edit / PDF / EPUB / BIB /
Created: May 21, 2021 / Updated: July 24, 2025 / Status: finished / 5 min read (~912 words)
software development
  • 3 tenets
    • Expect failures
    • Keep things simple
    • Automate everything
  • The entire service must be capable of surviving failure without human administrative interaction
  • The best way to test the failure path is never to shut the service down normally. Just hard-fail it
  • The acid test: is the operations team willing and able to bring down any server in the service at any time without draining the work load first?
    • If they are, then there is synchronous redundancy (no data loss), failure detection, and automatic take-over
  • Large clusters of commodity servers are much less expensive than the small number of large servers they replace
  • Server performance continues to increase much faster than I/O performance, making a small server a more balanced system for a given amount of disk
  • Power consumption scales linearly with servers but cubically with clock frequency, making higher performance servers more expensive to operate
  • A small server affects a smaller proportion of the overall service workload when failing over
  • Two factors that make some services less expensive to develop and faster to evolve than most packaged products are
    • the software needs to only target a single internal deployment
    • previous versions don't have to be supported for a decade as is the case for enterprise-targeted products
  • Basic design tenets
    • Design for failure
    • Implement redundancy and fault recovery
    • Depend upon a commodity hardware slice
    • Support single-version software
    • Enable multi-tenancy
  • Each pod should be as close to 100% independent and without interpod correlated failures
  • What isn't tested in production won't work, so periodically the operations team should a fire drill using these tools
  • If the service-availability risk of a drill is excessively high, then insufficient investment has been made in the design, development, and testing of the tools
  • Some form of throttling or admission control is common at the entry to the service, but there should also be admission control at all major components boundaries
  • The general rule is to attempt to gracefully degrade rather than hard failing and to block entry to the service before giving uniform poor service to all users
  • Partitions should be infinitely-adjustable and fine-grained, and not be bounded by any real world entity
    • We recommend using a look-up table at the mid-tier that maps fine-grained entities, typically users, to the system where their data is managed
  • Expect to run in a mixed-version environment. The goal is to run single version software but multiple versions will be live during rollout and production testing
  • Best practices in designing for automation include
    • Be restartable and redundant
    • Support geo-distribution
    • Automatic provisioning and installation
    • Configuration and code as a unit
    • Manage server roles or personalities rather than servers
    • Multi-system failures are common
    • Recover at the service level
    • Never rely on local storage for non-recoverable information
    • Keep deployment simple
    • Fail services regularly
  • Dependency management
    • Expect latency
    • Isolate failures
    • Use shipping and proven components
    • Implement inter-service monitoring and alerting
    • Dependent services require the same design point
    • Decouple components
  • Testing in production is a reality and needs to be part of the quality assurance approach used by all internet-scale services
  • The following rules must be followed
    • The production system has to have sufficient redundancy that, in the event of catastrophic new service failure, state can be quickly recovered
    • Data corruption or state-related failures have to be extremely unlikely (functional testing must first be passing)
    • Errors must be detected and the engineering team (rather than operations) must be monitoring system health of the code in test
    • It must be possible to quickly roll back all changes and this roll back must be tested before going into production
  • Big-bang deployments are very dangerous
  • We favor deployment mid-day rather than at night
  • Some best practices for release cycle and testing include
    • Ship often
    • Use production data to find problems
      • A few strategies
        • Measurable release criteria
        • Tune goals in real time
        • Always collect the actual numbers
        • Minimize false positives
        • Analyze trends
        • Make the system health highly visible
        • Monitor continuously
        • Invest in engineering
        • Support version roll-back
        • Maintain forward and backward compatibility
        • Single-server deployment
        • Stress test for load
        • Perform capacity and performance testing prior to new releases
        • Build and deploy shallowly and iteratively
        • Test with real data
        • Run system-level acceptance tests
        • Test and develop in full environments
  • Best practices for hardware selection include
    • Use only standard SKUs
    • Purchase full racks
    • Write to hardware abstraction
    • Abstract the network and naming
  • Make the development team responsible
  • Soft delete only
  • Track resource allocation
  • Make one change at a time
  • Make everything configurable
  • To be effective, each alert has to represent a problem
  • To get alerting levels correct, two metrics can help and are worth tracking
    • alerts-to-trouble ticket ratio (with a goal of near one)
    • number of systems health issues without corresponding alerts (with a goal of near zero)
  • Best practices include
    • Instrument everything
    • Data is the most valuable asset
    • Have a customer view of service
    • Instrument for production testing
    • Latencies are the toughest problem
    • Have sufficient production data
      • The most important data we've relied upon includes
        • Use performance counters for all operations
        • Audit all operations
        • Track all fault tolerance mechanisms
        • Track operations against important entities
        • Asserts
        • Keep historical data
    • Configurable logging
    • Expose health information for monitoring
    • Make all reported errors actionable
    • Enable quick diagnosis of production problems
      • Give enough information to diagnose
      • Chain of evidence
      • Debugging in production
      • Record all significant actions
  • Two best practices, a "big red switch" and admission control, need to be tailored to each service
  • Support a "big red switch"
    • The ability to shed non-critical load in an emergency
  • Control admission
  • Meter admission