Well, Google just released some astounding figures about the size of the Web. According to a blog post on the Official Google Blog, the Google index contained 26 million pages in 1998. By 2000, the index reached the one billion mark, and just recently, Google hit a new milestone: 1 trillion unique URL’s/links on the web at once!
Yes, 1 trillion unique URL’s. That is 1,000,000,000,000 URL’s. That is a lot of URL’s.
What I find even more fascinating is the fact that the Google servers, or at least the capacity of these servers, are large enough to hold 1 trillion unique URL records and more. This obviously goes without saying that they not only index most of those URL’s but all of the content contained on those URL’s being indexed – that is the page content, meta data, images, files, etc.
Another interesting fact is that not even Google knows the exact size of the Web (gasp)! So, 1 trillion unique URL’s are just the records that Google have indexed and are aware of. There’s more out there.
In the post, Google’s Software Engineers, Jesse Alpert & Nissan Hajaj, mentions that Google does not index every one of those trillion pages as many of the pages are similar to each other or represent auto-generated content, but that they are extremely proud to have the most comprehensive index of any search engine.
Google wants to obviously keep up with the growing volume of information and thus downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day.
In doing so, Google’s distributed infrastructure allows applications to efficiently traverse – what the Engineers call “a link graph” – with many trillions of connections, or quickly sort petabytes of data, all just so that you get the answer to the most important question: your next Google search.