🔍 I. Fundamentals of Search Engines
Topic | Description |
Information Retrieval Basics | TF-IDF, BM25, cosine similarity, boolean retrieval |
Inverted Index | Core data structure mapping terms to document lists |
Ranking Models | Lexical models (TF-IDF, BM25), Learning-to-Rank |
Tokenization & Text Normalization | Stop-word removal, stemming, lemmatization, case-folding |
Relevance Evaluation | Precision, recall, F1-score, MAP, NDCG |
Document Scoring | How documents are ranked given a query |
Query Types | Keyword, phrase, wildcard, fuzzy, semantic |
🛠️ II. Core Components of a Search System
Component | Subtopics |
1. Crawler (Web Crawler / Spider) | URL frontier, politeness, deduplication, distributed crawling |
2. Document Processor | Parsing, content extraction, metadata tagging |
3. Indexer | Building and updating inverted index, token position info |
4. Query Processor | Parsing query, type detection (e.g., navigational vs. informational), query rewriting |
5. Ranker | Heuristics + ML models, freshness, location, personalization |
6. Storage Layer | Sharded document stores, Lucene segments, compressed indexes |
⚙️ III. System Design Concepts
Topic | Description |
Index Sharding | How to split index by term or document |
Index Replication | Redundancy, fault tolerance, read scalability |
Index Merging | Segment merging strategies (Lucene) |
Near-Real-Time Search | Soft commit vs. hard commit |
Write Path | Bulk ingestion, streaming ingestion, delta updates |
Read Path | Query routing, fan-out query, top-K merging |
Search Result Caching | Query result cache, term-level cache, doc cache |
Search-as-you-type | Prefix indexing, edge n-grams, trie-based search |
🌐 IV. Scalability & Distributed Architecture
Topic | Description |
Distributed Indexing | Partitioning, leader election, document routing |
Fault Tolerance | Failover nodes, quorum reads/writes |
Eventual Consistency | Especially with asynchronous index updates |
Geo-Distributed Search | Latency-aware routing, region-based sharding |
CAP Theorem in Search | Trade-offs: availability vs. consistency |
🤖 V. Advanced Search Features
Topic | Description |
Learning to Rank (LTR) | ML models to re-rank results (LambdaMART, XGBoost) |
Semantic Search | Embeddings, vector-based search (ANN, HNSW) |
Personalized Search | User behavior profiling, collaborative filtering |
Federated Search | Querying multiple data sources, merging results |
Faceted Search | Filtering and aggregating by categories |
Spell Correction / Query Suggestions | Edit distance, n-gram models, click-based learning |
Autocomplete / Typeahead | Trie structures, popularity-based ranking |
📊 VI. Monitoring, Metrics & Observability
Area | Metrics |
Query Performance | Latency percentiles, QPS, cache hit ratio |
Index Health | Merge frequency, segment size, index bloat |
Cluster Metrics | Disk usage, memory usage, node availability |
Search Quality | CTR (Click-through rate), bounce rate, dwell time |
☁️ VII. Tools & Frameworks
Tool | Usage |
Lucene | Core search library |
Solr | Enterprise search built on Lucene |
Elasticsearch | Scalable RESTful search engine |
Vespa.ai | ML-powered large-scale serving and search |
Whoosh | Pure Python search engine for small projects |
🧪 Interview-Specific System Design Topics
Design Challenge | Areas to Explore |
Design a Google-like search engine | Crawl -> Index -> Rank -> Serve |
Design internal product search (e.g., Amazon) | Faceted search, freshness, business ranking |
Design autocomplete / suggestions | Trie, caching, popularity scoring |
Design semantic search system | Vector index, embedding generation, ANN |
Design scalable log search (e.g., Splunk) | Time-partitioned indexes, compression, query filtering |