Information Retrieval Basics

 TF-IDF, BM25, cosine similarity, boolean retrieval

🔍 1. TF-IDF (Term Frequency–Inverse Document Frequency)

🧠 Think of it like this:

If you're looking for the most important words in a document, TF-IDF helps you find them.

  • TF (Term Frequency): Measures how often a word appears in a document.

  • IDF (Inverse Document Frequency): Tells you if that word is common or rare across all documents.

🧴 Real-life Example:

If you're reading 1,000 recipes and the word “salt” appears in every one, it's not a useful word for search. But if one recipe has the word “truffle” and it's rare elsewhere, that word is important!

✅ Why use it?

Because not all words are equally useful. TF-IDF gives higher weight to rare and meaningful words.

🏆 2. BM25 (Better Match 25)

🧠 Think of it like this:

BM25 is a smarter version of TF-IDF that says:

“Okay, just because a word appears 100 times in a document doesn’t mean it’s 100 times more relevant.”

💬 Analogy:

Imagine you’re reviewing resumes. If someone writes “Python” once, that’s good. If someone writes it 50 times, you won’t think they’re 50 times better—you’ll think they’re trying too hard. BM25 handles that.

⚖️ Also considers:

  • Length of document: A short doc saying "Python" twice may be more relevant than a long article saying it 10 times.

📐 3. Cosine Similarity

🧠 Think of it like this:

Cosine Similarity is a way to compare two things (like a query and a document) by measuring the angle between them in space.

📊 Analogy:

Imagine turning two arrows on a graph:

  • If both arrows point in the same direction, they’re very similar (cosine = 1).

  • If they point in opposite directions, they’re completely different (cosine = -1).

  • If they’re at a 90° angle, they’re unrelated (cosine = 0).

✅ Used For:

When both the query and document are turned into vectors (lists of word numbers), cosine similarity checks how "aligned" they are.

✅ 4. Boolean Retrieval

🧠 Think of it like this:

This is the simplest form of search. You use logic like AND, OR, NOT to find what you want.

🕵️‍♂️ Example:

  • Searching for “pizza AND cheese” → returns only things that have both.

  • “pizza OR burger” → shows anything with either.

  • “pizza AND NOT pineapple” → finds pizza without pineapple!

🧰 How it works:

It builds a checklist of which documents have which words. Then it uses your logic (AND, OR, NOT) to filter the list.


🎯 Summary with Real-Life Analogies

Concept

Simple Analogy

Key Idea

TF-IDF

Find rare ingredients in a recipe book

Rare = important

BM25

Resume filtering—don't overuse keywords

Smarter keyword scoring

Cosine Sim.

Are two arrows pointing the same way?

Similarity in direction = related text

Boolean

Using filters like "must have" or "not in"

Logical filtering of results



Distributed by Gooyaabi Templates | Designed by OddThemes