Search Engine Indexing

Internet - the physical network and underlying communication infrastructure that connects computers/networks
World Wide Web - an information system that enables content sharing over the Internet
Internet of Things - the network of physical devices, allowing them to collect and exchange data over the internet, enabling remote monitoring, control, and automation for greater efficiency and convenience in daily life and industries

Search Engines - make information on the web easier to find
Search engines work in three stages: crawling, indexing, and ranking
When a term is searched on a search engine, it doesn’t search the entire internet
Instead, it searches through its index of sites
Searching the web in real-time would be impossible due to the computing and networking requirements to contact every server and the fact that websites are constantly being created and deleted

Search engines use programs known as crawlers to traverse the world wide web and index any pages, content, or metadata they find and map links between pages by following internal and external links
Crawlers start at seed URLs, pages with lots of outgoing links, to start their map of the web, following links to reach all the pages it can find
The crawler extracts important information such as keywords, page titles, headings, links, and metadata
It takes each word in the document and adds and entry for the page under the word in the index, alongside the word’s position in the page
In doing so, they continuously add to and update the index

Indexing - the process of a search engine collecting, sorting, and storing data in its index
The index is the place where all the data the crawler has gathered is located
Each word in the document is included in the page’s index as an entry, along with the word’s position on the page
The index allows for quick retrieval and ranking of relevant web pages in response to user queries
When search results are shown, it is the search engine index that provides these
Searching the index is very fast, but the index must be constantly updated to ensure that:
- New sites are added
- Old sites are removed
- Broken links are updated

The crawler crawls web pages by following internal and external links to discover new content and updates to existing sites
The content of the sites is analysed to understand the context and relevance of the information
- The HTML structure is examined to extract keywords, metadata, headings, and links to build a comprehensive map of the page
Entries for sites are added to the index, creating an organised database that maps specific terms to their locations and frequency within documents
When a user searches, the query is passed to the ranking algorithm to determine the best matches from the index by assessing factors such as relevance and site authority

Improved search results - indexing provides users with relevant matches, increasing the chances of finding valuable information on the first page
Efficient retrieval - search engines produce results quickly by searching indexed data rather than scanning the entire web for every query
Ranking and relevance - indexing allows algorithms to assess the quality and relevance of pages using factors like keywords and engagement
Freshness and updates - periodic crawling ensures that search results display the latest content currently accessible on the Internet

CS Notes