• Internet - the physical network and underlying communication infrastructure that connects computers/networks
  • World Wide Web - an information system that enables content sharing over the Internet
  • Internet of Things - the network of physical devices, allowing them to collect and exchange data over the internet, enabling remote monitoring, control, and automation for greater efficiency and convenience in daily life and industries

Search Engines

  • Search Engines - make information on the web easier to find
  • Search engines work in three stages: crawling, indexing, and ranking
  • When a term is searched on a search engine, it doesn’t search the entire internet
  • Instead, it searches through its index of sites
  • Searching the web in real-time would be impossible due to the computing and networking requirements to contact every server and the fact that websites are constantly being created and deleted

Crawling

  • Search engines use programs known as crawlers to traverse the world wide web and index any pages, content, or metadata they find and map links between pages by following internal and external links
  • Crawlers start at seed URLs, pages with lots of outgoing links, to start their map of the web, following links to reach all the pages it can find
  • The crawler extracts important information such as keywords, page titles, headings, links, and metadata
  • It takes each word in the document and adds and entry for the page under the word in the index, alongside the word’s position in the page
  • In doing so, they continuously add to and update the index

Indexing

  • Indexing - the process of a search engine collecting, sorting, and storing data in its index
  • The index is the place where all the data the crawler has gathered is located
  • Each word in the document is included in the page’s index as an entry, along with the word’s position on the page
  • Website crawlers follow rules and guidelines established by website owners, using mechanisms like the robots.txt file
    • These guidelines direct crawlers on which areas of a website to explore or avoid, respecting website preferences and ensuring privacy
  • The index allows for quick retrieval and ranking of relevant web pages in response to user queries
  • When search results are shown, it is the search engine index that provides these
  • Searching the index is very fast, but the index must be constantly updated to ensure that:
    • New sites are added
    • Old sites are removed
    • Broken links are updated

Search Engine Process

  1. The crawler crawls web pages by following internal and external links to discover new content and updates to existing sites
  2. The content of the sites is analysed to understand the context and relevance of the information
    • The HTML structure is examined to extract keywords, metadata, headings, and links to build a comprehensive map of the page
  3. Entries for sites are added to the index, creating an organised database that maps specific terms to their locations and frequency within documents
  4. When a user searches, the query is passed to the ranking algorithm to determine the best matches from the index by assessing factors such as relevance and site authority

Benefits Of Search Engine Crawling And Indexing

  • Improved search results - indexing provides users with relevant matches, increasing the chances of finding valuable information on the first page
  • Efficient retrieval - search engines produce results quickly by searching indexed data rather than scanning the entire web for every query
  • Ranking and relevance - indexing allows algorithms to assess the quality and relevance of pages using factors like keywords and engagement
  • Freshness and updates - periodic crawling ensures that search results display the latest content currently accessible on the Internet