The Deep Web

As we continue the tour, we move into what is known as the deep web, AKA the invisible web. This portion is significantly larger than the surface (estimated to be 400-550 times the size), making up 90-96% of the web. There is a common misconception that because something is found on the deep web, it is likely to be illegal or used for unlawful purposes; however, this is mostly not the case.

The concept of the deep web simply refers to any part of the World Wide Web which is not indexed by traditional or common search engines (such as Google or Bing). The documents and sites of the deep web are therefore unable to be crawled by those search engines. Here are some reasons why a webpage or site wouldn’t be crawled (Hawkins, 2016):

  • The webpage is password protected
  • It can only be accessed a certain amount of times and that threshold is met before the crawler reaches the page
  • It’s hidden, or not linked to anything else
  • The site’s root text files explicitly state not to crawl it

According to a study based on data collected by BrightPlanet in 2000, the deep web contains 7,500 terabytes (TB) of information, compared to the 19 TB of the Surface Web (Bergman, 2001).

Even with statistics like these, there are limitations to determining the full scale of the deep web. It is practically impossible to estimate the exact size of the deep web due to the anonymity many of its parts (such as the dark web) afford, among other factors (Finklea, 2017).

While a small portion of the deep web known as the dark web is home to illegal activities (more on this later), the majority of it is actually safe, necessary, and can be useful for different purposes.

The deep web includes content such as academic databases, private internal networks of corporations, universities, or governmental agencies (intranets), financial records and accounts of people using online banking services, medical records, and so much more (Sui, Caverlee, and Rudesill, 2015).

The deep web is home to many databases, ranging from free to the public to commercial and pay-to-use to entirely private.

Examples of public online databases using the deep web:

U.S. Library of Congress (https://www.loc.gov/)

PubMed (https://pubmed.ncbi.nlm.nih.gov/)- National Library of Medicine

JSTOR (https://www.jstor.org/)- scholarly journals, primary sources, and books

Wayback Machine (https://archive.org/web/)- an archive of past versions of webpages

The deep web might seem separated from the surface, but it’s likely that many people have used it without realizing, in today’s age more than ever. Especially with the rise of Web 2.0 (referring to the emphasis on user generated content and online participatory cultures through platforms such as social media) and smartphones, so much new information is stored on the web within these various networks which is not accessible through normal search engines (Sui, Caverlee, and Rudesill, 2015). The deep web has expanded as a result of these new technologies, platforms, and services. This includes services such as PayPal and even online cloud storage like Dropbox, where the user’s information/data is private and therefore resides on the deep web.

Even though some of these websites can be found from a Google search, parts of the site utilize the deep web for security purposes, to store information. Any password-protected website where the information is only available to users who have access (through something like an account, membership, or subscription) is technically part of the deep web.

Some content here on the deep web, such as a database of academic journals, can be accessed by navigating to a more specialized search engine from a standard search engine. However, some of the deepest hidden regions and content can only be accessed by using a different browser, such as Tor.

As we keep moving deeper through the web, we now find a small but infamously known portion of the deep web…

Click on this text to continue ^^^