Real Search Engines – Silicon Brighton Style

 

Mojeek is one of seven real web search engines and the only one outside of USA, Russia, and China. It’s a company founded in Sussex by Marc Smith and, whilst fully remote, has a predominantly Sussex based team. As the only search engine that does not track, Mojeek offers unique and important benefits to its global users.

Search with surveillance was born in Silicon Valley. Search without surveillance was born locally. We asked CEO, Colin Hayhurst to tell us about Mojeek and how search really works. 


Search engines are our main navigators to useful information on the Web. We use them to conveniently hop to a known destination. We use them to discover and learn. We check our assumptions and pose queries. And yet, very few people understand the work that goes into building a large-scale general search engine. One of those few people is lifelong Sussex resident Marc Smith. Marc a developer and the Mojeek founder, has built from scratch a full search technology stack, since starting it as a hobby project in 2004.

Numerous search engines have come and gone since 2004. Today, the only general and international web search engines are Baidu, Bing, Gigablast, Google, Mojeek, Sogou and Yandex. Yes, only seven. Seven what we call real search engines.

Later we review the main components that make up a real search engine. This will help your understanding of what goes on when you use Google or something else. First, some important context.


Image 1 Countries.png


Search Realities

The biggest challenge for a large-scale search engine is building a large, and efficiently readable, index of the Web; one that covers billions of pages. Because of this challenge a lot more companies (at least 40) opt to provide search or metasearch services. Examples of these such as DuckDuckGo, Ecosia, and Qwant get their search results and ads from Bing (Microsoft). A few others, such as Startpage, get the same from Google.

Without an index of their own, search companies are dependent on real search engines. So to understand the way in which different search services guide your journeys across the Web, you really should understand the index they are using. And importantly the companies they support, depend upon and are constrained by. At Mojeek, we created searchenginemap.com so you can see these dependencies.

At Mojeek our own technology stack means we have no such dependencies. We only use the services of an independent data centre. As we scale up, we are keen to minimise our carbon footprint so we use the Custodian data centre, in Kent, which is the UK’s greenest provider.


Image 2 SearchEngineMap.png

So what makes up a real search engine? In practice are three main parts explained in the following sections; a crawler, an index, and ranking methodologies.


Crawling

Crawling is the process by which Internet bots (known in search as a spiders) systematically browse the Web and discover the content needed to build a searchable index. As spiders crawl webpages, they build up lists of URLs (Uniform Resource Locators) found on visited pages. These lists of URLs serve in turn to direct the spiders to further indexable pages.

On each page, spiders take only relevant pieces of data, such as the page content and links, putting the resulting bundle into a queue for detailed indexing. Crawling is, in practice, a far from a simple task. Even so, indexing requires around 100 times the computing resources of that for crawling.


Image 3 Crawl-Index-Rank.jpg

 

Indexing

Indexing takes data gathered by the crawler and puts it into compact data structures. These processes, complex in practice, involve things such as parsing, sorting and compressing the contents of each page. Once a page is in the search index, it becomes a candidate to be displayed in response to relevant search queries. The indexing process and structure has to be designed well so that a search engine can efficiently serve results. Whilst a search index of one million pages is relatively straightforward, and will suffice for most internal site search functions, an index of billions of Web pages is a completely different scale of challenge.

To make matters more difficult, whilst the index for a book is static, a Web search index needs to be constantly updated as websites change. Crawlers need to keep returning to pages previously crawled, in order to update the index.

When someone enters and submits a search phrase, the search engine will find pages that include that search phrase or something similar. In order for this process to result in a set of useful results, the search engine must also undertake another process to sort these results; this process is known as ranking.


joshua-golde-qIu77BsFdds-unsplash.jpg

Ranking

A search engine uses a wide variety of ranking factors when deciding which results to display. Each result will typically show a title, a short content highlight (“snippet”), sometimes an image, and a hyperlink to the page.

An important distinction should be made here between non-tracking search engines/services and those which undertake large levels of data collection. Mojeek is the only real search engine which does not engage in tracking. DuckDuckGo (and others) also offer some privacy and limited tracking, but all of them send your search queries to, and get results from, Microsoft (or Google and sometimes Yandex). Importantly when search engines engage in tracking, ranking will depend upon a large number of different “personalisation” factors and algorithms.

Pages found from the index, that match with a search query, will appear on the search page in order of relevance. Relevance is determined by the search engine’s ranking factors and algorithms; an explanation of which would require a much longer article. SEO (Search Engine Optimization) is about influencing the way in which search engines rank webpages. SEO is, of course, almost entirely focussed on Google ranking, since it dominates the search market.

When tracking is not used, the ranking of results will be obtained using objective factors. Mojeek is arguably unique in this respect. Google and Bing rank results based on data that it has collected or inferred about you. These inferences can come from cookies, aggregated data and machine learning (AI) algorithms.  Tracking, data collection and ranking practices by Big Tech has led to the prevalence of what is now recognised as surveillance capitalism.


How did we get here?

Few realise that many of the emerging and so-called search engines are actually search services. Sadly real search engines have become rarer as the Web has grown bigger and older. Still, it takes significant resources, innovation and resourcefulness to build and sustain a search engine business. At Mojeek we have existed as a business since 2009, slowly grown and now are focussed on developing our business model. We are committed to our no tracking principles.

Would we tolerate a society where only two TV News Channels existed, say CNN and Fox News? Obviously not. And yet, we have effectively just two English-language search indexes; Bing and Google, owned by two US behemoths.


Who doesn’t know that Google started in a garage in Silicon Valley? Closer to home we have Mojeek, started in a Sussex bedroom. Who needs search Silicon Valley style when you can have it Silicon Brighton style?