A quick note: I was reading Techcrunch earlier today, when I realized that I could not reconcile their post “Twitter at scale: Will it work?” with my views on how to build a scalable application.
Nick Cubrilovic contends that “Every new Twitter user and every new connection results in an exponentially greater computational requirement.”
And yet, I fail to see the exponential quality of it all.
It looks like Nick is saying: “I have U users, posting P posts, read by F followers. Hence, if I were to draw this on paper, I would end up with an exponential slope.”
That’s odd because, as I understand Twitter’s architecture, we indeed have U people posting P posts – BTW, P is unknown, and as Twitter goes I’d wager that the f(P) curve would be logarithmic; but I digress. Let’s, for simplicity’s sake, consider the total number of posts and call it X.
Now, Nick would obviously be referring to a f(F) curve. If F followers have to monitor X posts, then yes, I expect the slope to be exponential.
But that’s not how it works: Twitter is a pull service; each Twitter client regularly asks the server(s): “Do you have anything for me?” The server replies: “No” or “Yes, these x posts.” “x”, not “X” because only relevant posts are retrieved.
Since “F” is bound to be much bigger than “x”, and the overheard of retrieving multiple posts is very small compared to the time elapsed between two polls, it seems to me that the formula we should use here is that of a linear approximation.
Just my 2 cents. My math is *very* rusty but it seems to me that Nick’s argument doesn’t hold water.
If you enjoyed this post, make sure you subscribe to my RSS feed!






The flaw in most of these arguments is that people assume that a new user on Twitter will get a lot of followers. If each new user is a Mike Arrington or Robert Scoble that both start following lots of people AND that a large percentage of their existing userbase will follow, then each user adds exponentially to their load.
The reason that scenario results in exponential load is the distribution issue: You really don’t want a single database – you’ll want to partition.
But not even Mike or Scoble have more than a tiny fraction of Twitters users following them, and most people will get only a small number of followers, just like in the real world most people don’t talk to everyone else.
The “cost” of a very well connected user will be high, but the average user cost will be fairly trivial.
Frankly, Twitter is like e-mail:
- Most users send and receive a small volume of mail from a small number of people.
- Some people send vast amounts of e-mail (mailing lists, spammers, etc.) to a large number of people, and a large number of people reply to them individually, and may even copy lots of people.
Scaling that is “easy” for the most part: Make copies. Treat each users in- and out-going as “mailboxes” and spread them out to multiple boxes. DNS, LDAP or any number of directory mechanisms can be used for the partitioning, or you can simply hash the userid’s.
Yes, that is more expensive in theory than building one massive database, because you’ll store a ton of copies of messages from popular people. But in reality you’re fighting the problem of disk IO, and making copies makes using multiple small server with inexpensive disk subsystems feasible, rather than investing in ridiculously overpriced fridge-sized storage systems.
There are a potential for a lot of expensive functions in Twitter though: Letting users “stitch” together streams of arbitrary results. However that can be solved simply by indexing each partition (or shard or whatever you want to call it), and merging results – that’s a pretty basic approach to scaling search engines.
Scaling a service like this is a solved problem, no matter how much people try to argue otherwise.
Vidar, thanks for your well-reasoned comment.