Tuesday, January 11, 2005

Content Networking by the Numbers

Google has indexed, as of the moment I write this, 8,058,044,651 pages. Google also indexes about 1 billion images and about 1 billion netnews postings. In a masterful piece of S-1 filing divination, Tristan Louis estimated the size of Google’s computer infrastructure: 719 racks; 63,272 machines; 126,544 CPUs; 253,088 Ghz of processing power; 126,544 Gb of RAM; 5,062 Tb of hard drive space. You can find his research on the topic here: http://www.tnl.net/blog/entry/How_many_Google_machines.

Pretty impressive. Google users evidently think that Google finds what they are looking for. So, not only is Google large, it has successfully found what most users want to find. Google has made meta-search an anachronism. Few now have the resources to catch Google.

Any flaws? Some people complain that Google can be gamed. Others complain that Google censors. But the real issue is that Google indexes only a tiny fraction of the Web. This analysis - http://www.brightplanet.com/technology/deepweb.asp - claims that fraction is somewhere between 1/120th and 1/620th.

An interesting aside in Bright Planet’s analysis is that original deep Web content now outstrips printed content. Yes, the Web is now bigger than the printed word.

Bright Planet’s analysis excludes images from the size measurement of the un-spidered “deep Web.” So where do we look to get a handle on the size of multimedia content in the Internet? In this case, we can look to measures of Internet traffic – specifically P2P traffic.

The numbers are stunning: CacheLogic, a maker of traffic management and network intelligence (deep packet inspection) gear, found that P2P traffic ranges from 55% to 80% of the bits traversing the Internet (http://www.cachelogic.com/research/slide12.php). The Web, which is 120 to 620 times larger than Google has indexed, is only 5% to 20% of Internet traffic. A single movie or TV show can significantly drive traffic levels in an ISP’s network. Boggling, but what does it all mean? Here we can turn to the Eight Fallacies.

Topically, the Eight Fallacies were formulated by Bill Joy, Dave Lyon, Peter Deutsch, and James Gosling – some of the brightest of Sun’s luminaries – as Sun was formulating its approach to the mobile computing market. If you don’t pay attention to the numbers, you can fall into one or more of the Eight Fallacies:

  • The network is reliable
  • Latency is zero
  • Bandwidth is infinite
  • The network is secure
  • There's a single administrator
  • The topology won't change
  • Transport cost is zero
  • The network is homogeneous

It could take a book chapter to fully elucidate the meaning of each of the Eight Fallacies in the context of content networking. But the main point is that content is so big that the Internet will be designed around moving content. The Web is just the literate scum floating on top of an Internet that is rapidly evolving toward the post-literate masses. Never thought you anyone need those monster petabit routers? Think again. Asymmetrical last mile OK? Maybe not. Many other apple carts will be overturned.

What category of application will drive the next big shift in Internet traffic? Content networking will probably be a big part of that next killer app.