Quora question: How did companies like Uber & Whatsapp expand their backend database structured storage rapidly when faced with a massive influx of users? How does one maintain service, and prepare?
Stan and everyone has been giving great answers. But none of them fully cover the actual question - how did they scale out the databases to handle all that load?
No doubt using the cloud instead of running your own servers enables this, but how? In the same way that saying “the wind allows us to sail” does not explain how a ship can travel across an ocean, saying “use the cloud” does not explain how a services backend database can scale quickly and cost-effectively to meet sudden surges in demand.
Part of the answer has been covered by others - microservices. Uber, Netflix, Whatsapp, Dropbox, and many others started with one big app and then slowly broke of bits to run independently as microservices.
Each one “does one thing and one thing well” (to reuse a Unix adage), and each has their own (relatively) database that only they can use.
When the user performs an action, for example, the event passes through a bunch of queues from one microservice to another, each microservice doing something with it, until the whole task is complete.
They make sure that each is fault-tolerant enough that it can keep doing its thing if a dependency goes down. For example, they could use old data from a cache, or return limited results.
That way, they do not have to scale one huge database, but several ones, along with their service backends themselves.
Microservices also allow teams to be more independent, which is ideal for when you have a rapidly changing architecture - each team can focus on a small piece of the pie, and does not have to be affected if another part goes through a sudden transformation.
Of course, these microservices cannot all work in sync - that would be way too hard - so you have to depend on an effect known as eventual consistency - the system will eventually be in sync, but not immediately.
Think about the times you clicked a category on Netflix, and the spinner took longer than usual to display titles in that category. That little microservice may have been overloaded at that exact moment, or a server running it might have gone down - regardless, the UI knew to expect failure, so kept on retrying, knowing that it’d get a result eventually.
Think about when you stored a file in Dropbox, but it did not appear in the search results when you searched for something. Or perhaps it took a long while to find it. But later when you searched for the same term, your file appeared immediately. Again, another example of eventual consistency.
By making small services, each with their own localized databases that only they can talk to.
By building systems to complete operations in an eventually consistent manner, rather than as an immediate transaction.
By connecting everything together using queues that can grow as long as necessary to hold any backlog while their consumers get themselves into gear and scale up to start processing the extra capacity.
That is how ultra-scalable backend databases are built to handle any amount of users piling onto them at any moment.
But ultimately, the real way to prepare for all this is to make sure you hire well. Uber and Netflix are known to hire some of the great stars from schools of monitoring, automation, and incident handling. They impose very little process and allow them the freedom to define how they work and how they handle challenges.
The above techniques were born from giving skilled engineers the freedom to do what they think is best - which is ultimately the best way to overcome all challenges, not just this one.