Word Embeddings Explained: The Math Behind AI, LLMs, and Chatbots

    NLP Explainer · AI Series 2026
  

How Machines Understand Language

A guide to word embeddings — where meaning becomes mathematics, and vectors do the talking.

When a search engine retrieves a document about automobiles in response to a query about cars, it is not matching text character by character. Somewhere beneath the interface, the system understands that these two words are semantically related. The mechanism behind that understanding is the word embedding — and once you see the geometry, you cannot unsee it.

This article walks through the key mathematical operations that make embeddings work: distance, similarity, arithmetic, scaling, and the dot product. Each concept is illustrated with concrete numerical vectors so the math is visible, not just described. Real embeddings typically use hundreds of dimensions; the 3- and 4-dimensional examples here preserve all the structure while staying readable on a page.

1 · What is a Word Embedding?

A word embedding is a representation of a word as a vector — an ordered list of numbers — in a high-dimensional space. A typical embedding model might use 300 dimensions, so the word cat becomes a point with 300 coordinates. That sounds abstract, but the key insight is this: the position of that point encodes meaning.

This is what researchers call a semantic space. Words with related meanings end up positioned close to each other. King and Queen live near each other. Paris and London live near each other. Bicycle and democracy live far apart. The model learns these positions not from human-curated rules, but from the statistical patterns of how words appear together in enormous text corpora.

EXAMPLE: 4-DIMENSIONAL VECTORS (simplified from real 300-dim embeddings)
vec(“King”)   = [ 0.9,  0.7,  0.4,  +0.6 ]
vec(“Queen”)  = [ 0.9,  0.7,  0.4,  -0.6 ]
vec(“Man”)    = [ 0.5,  0.3,  0.1,  +0.8 ]
vec(“Woman”)  = [ 0.5,  0.3,  0.1,  -0.8 ]

The first three dimensions encode royalty, authority, and age.
The fourth dimension encodes gender: positive = masculine, negative = feminine.

Think of it as a map where the geography is meaning. Every word is a pin, and the distances between pins reflect semantic relationships rather than physical ones.

2 · The Geometry of Meaning: Distance and Similarity

Once words are points in space, we need a way to measure how close they are. Two approaches dominate: Euclidean distance and cosine similarity. For the examples below, we use a 3-dimensional temperature embedding:

TEMPERATURE VECTORS (3 dimensions)
vec(“Hot”)  = [  1.0,  0.8,  0.6 ]
vec(“Warm”) = [  0.8,  0.6,  0.4 ]
vec(“Cold”) = [ -0.6,  0.4, -0.8 ]
      

2.1 Euclidean (Cartesian) Distance

The most intuitive measure — the straight-line gap between the tips of two arrows drawn from the origin. For vectors a and b in n dimensions:

        d(a, b)  =  √ Σi ( ai − bi )2 
      

WORKED EXAMPLE: EUCLIDEAN DISTANCE
// Hot vs Warm (similar words)
d(Hot, Warm) = √[ (1.0-0.8)2 + (0.8-0.6)2 + (0.6-0.4)2 ]

                      = √[ 0.04 + 0.04 + 0.04 ] = √0.12  &approx;  0.346  ← small: close together

// Hot vs Cold (opposite words)
d(Hot, Cold) = √[ (1.0-(-0.6))2 + (0.8-0.4)2 + (0.6-(-0.8))2 ]

                      = √[ 2.56 + 0.16 + 1.96 ] = √4.68  &approx;  2.163  ← large: far apart
      

2.2 Cosine Similarity — The Industry Standard

In practice, NLP systems almost universally prefer cosine similarity over Euclidean distance. It ignores the length of vectors entirely and focuses only on the angle between them — two vectors pointing the same direction score 1.0 regardless of their magnitude.

COSINE SIMILARITY

              cos(θ)  =
              a  ·  b
            
              ‖a‖  ×  ‖b‖
            
        Range: −1  (opposite)  →  0  (orthogonal)  →  +1  (identical direction)

WORKED EXAMPLE: COSINE SIMILARITY
// First compute magnitudes

        ‖Hot‖  = √(1.02 + 0.82 + 0.62) = √2.00 &approx; 1.414

        ‖Warm‖ = √(0.82 + 0.62 + 0.42) = √1.16 &approx; 1.077

        ‖Cold‖ = √(0.62 + 0.42 + 0.82) = √1.16 &approx; 1.077

// Hot vs Warm (small angle)

        dot(Hot, Warm) = (1.0)(0.8) + (0.8)(0.6) + (0.6)(0.4) = 0.80 + 0.48 + 0.24 = 1.52

        cos(Hot, Warm) = 1.52 / (1.414 × 1.077) = 1.52 / 1.523 &approx; +0.998

// Hot vs Cold (large angle)

        dot(Hot, Cold) = (1.0)(-0.6) + (0.8)(0.4) + (0.6)(-0.8) = -0.60 + 0.32 – 0.48 = -0.76

        cos(Hot, Cold) = -0.76 / (1.414 × 1.077) = -0.76 / 1.523 &approx; -0.499

Word Pair	Euclidean d	cos(θ)	Interpretation
Hot vs Warm	0.346	+0.998	Nearly identical direction — closely related
Hot vs Cold	2.163	−0.499	Opposite directions — antonyms

3 · Vector Arithmetic: Meaning You Can Add and Subtract

Because words are vectors, you can perform arithmetic on them — and the results are semantically meaningful. The most famous example uses the 4-dimensional royalty vectors introduced in Section 1:

THE CLASSIC ANALOGY
vec(“King”) − vec(“Man”) + vec(“Woman”)  &approx;  vec(“Queen”)
      

WORKED EXAMPLE: KING – MAN + WOMAN
King   = [ 0.9,  0.7,  0.4,  +0.6 ]
Man    = [ 0.5,  0.3,  0.1,  +0.8 ]
Woman  = [ 0.5,  0.3,  0.1,  -0.8 ]

// Subtract component by component, then add

        King – Man = [ 0.9-0.5,  0.7-0.3,  0.4-0.1,  0.6-0.8 ] = [  0.4,   0.4,   0.3,  -0.2 ]

        + Woman   = [ 0.4+0.5,  0.4+0.3,  0.3+0.1,  -0.2+(-0.8) ] = [  0.9,   0.7,   0.4,  -1.0 ]

// Find nearest word by Euclidean distance
result = [ 0.9, 0.7, 0.4, -1.0 ]

d(result, Queen)  = √[ 0 + 0 + 0 + (-1.0-(-0.6))2 ] = √0.16 &approx; 0.400 ← nearest
d(result, Woman)  &approx; 0.671    d(result, King) = 1.600    d(result, Man) &approx; 1.910


        cos(result, Queen) &approx; 0.974   ← highest cosine similarity also points to Queen
      

What happened geometrically? Subtracting Man from King stripped out the gender dimension (+0.8 gone), leaving the royalty structure intact. Adding Woman injected the feminine gender value (-0.8). The result sits 0.4 units from Queen — the nearest word in this vocabulary.

4 · Scalar Multiplication and Division: Changing Intensity

Multiplying or dividing a vector by a scalar (a plain number) changes its magnitude without changing its direction. This maps neatly onto the idea of degree in language — Tiny, Large, and Gigantic all point in roughly the same semantic direction, but at different intensities.

SIZE VECTORS (3 dimensions)
vec(“Tiny”)     = [ 0.10, 0.20, 0.10 ]
vec(“Large”)    = [ 0.50, 0.70, 0.40 ]
vec(“Gigantic”) = [ 1.10, 1.50, 0.90 ]
      

WORKED EXAMPLE: SCALING ALONG THE SIZE AXIS
// Multiplying Large by 2 moves it toward Gigantic
Large × 2 = [ 0.5×2,  0.7×2,  0.4×2 ] = [ 1.00,  1.40,  0.80 ]
vec(“Gigantic”) = [ 1.10,  1.50,  0.90 ]    d(Large × 2, Gigantic) &approx; 0.173 ← very close

// Multiplying Large by 0.2 moves it toward Tiny
Large × 0.2 = [ 0.10,  0.14,  0.08 ]
vec(“Tiny”) = [ 0.10,  0.20,  0.10 ]    d(Large × 0.2, Tiny) &approx; 0.063 ← very close
      

Division works the same way along an intensity axis. Halving a “Loud” vector lands near “Soft”:

WORKED EXAMPLE: DIVIDING ALONG THE LOUDNESS AXIS
vec(“Loud”) = [ 0.90, 1.20, 0.60 ]    vec(“Soft”) = [ 0.30, 0.40, 0.20 ]
Loud ÷ 2 = [ 0.45,  0.60,  0.30 ]
d(Loud ÷ 2, Soft) &approx; 0.269  ← direction unchanged, intensity halved
      

Key intuition: Scalar operations change how much of something a vector represents, without changing what kind of thing it represents. Direction is preserved; intensity is tuned.

5 · The Dot Product: Agreement and Magnitude Together

The dot product of two vectors is computed by multiplying their corresponding components and summing the results:

DOT PRODUCT
a  ·  b  =  Σi  ( ai × bi )  =  a1b1  +  a2b2  + … +  anbn

The dot product is cosine similarity before normalising away the vector lengths. It captures two things simultaneously: the direction of agreement and the combined magnitude. Cosine similarity captures only the first.

We reuse the loudness vectors from Section 4 — Very Loud is “Loud” and A Little Loud is “Soft”. They point in exactly the same direction but have very different lengths:

WORKED EXAMPLE: VERY LOUD vs A LITTLE LOUD
vec(“A Little Loud”) = [ 0.30, 0.40, 0.20 ]  |magnitude| = 0.539
vec(“Very Loud”)     = [ 0.90, 1.20, 0.60 ]  |magnitude| = 1.616

// Cosine similarity: measures direction only

        dot(AL, VL) = (0.3)(0.9) + (0.4)(1.2) + (0.2)(0.6) = 0.27 + 0.48 + 0.12 = 0.87

        cos(AL, VL) = 0.87 / (0.539 × 1.616) = 0.87 / 0.871 &approx; 1.000

// Dot product: measures direction AND magnitude
AL · AL = (0.3)2 + (0.4)2 + (0.2)2 = 0.09 + 0.16 + 0.04 = 0.29
VL · VL = (0.9)2 + (1.2)2 + (0.6)2 = 0.81 + 1.44 + 0.36 = 2.61

Comparison	Magnitude	cos(θ)	v · v
A Little Loud	0.539	1.000 (same dir.)	0.29
Very Loud	1.616	1.000 (same dir.)	2.61

Both words are perfectly collinear — cosine similarity is 1.0 in both cases. But the dot products are 0.29 vs 2.61, a 9× difference. This is why recommendation systems and attention mechanisms in transformer models often prefer raw dot products: when you want to know not just whether a document is relevant but also how prominently it discusses a topic, the dot product gives you both dimensions at once.

6 · Practical Applications

Search engines convert your query into a vector and retrieve documents whose vectors are nearest to it in the semantic space — using cosine similarity to rank by relevance regardless of exact word match. When you search for car insurance and the engine returns results about vehicle coverage, it is doing nearest-neighbour lookup in embedding space, exactly as the Hot/Warm/Cold example in Section 2 demonstrates.

Recommendation systems represent your interests as a vector computed from your history, then find products whose vectors are closest to yours. The dot product is particularly useful here: a highly-relevant item with a large magnitude — analogous to Very Loud — will score higher than a mildly-relevant item even if they point in the same direction.

Large language models use the scaled dot product directly inside the attention mechanism. For every token, a query vector and a set of key vectors are compared via dot product to determine which parts of the context deserve attention — a direct descendant of the arithmetic explored in Section 5.

Quick Reference: Embedding Operations

Operation	Formula	Section 2-5 Result
Euclidean Distance	√( Σ (a_i − b_i)² )	d(Hot,Warm) = 0.346 d(Hot,Cold) = 2.163
Cosine Similarity	(a·b) / (‖a‖×‖b‖)	cos(Hot,Warm) = +0.998 cos(Hot,Cold) = -0.499
Vector Arithmetic	a ± b	King-Man+Woman → nearest Queen (d = 0.400)
Scalar Multiplication	λ · a	Large × 2 → near Gigantic Loud ÷ 2 → near Soft
Dot Product	a·b = Σ a_ib_i	cos = 1.00 for both; dot 0.29 (soft) vs 2.61 (loud)

✦ This article was generated with the assistance of Claude by Anthropic ✦

Quantum Computing: The Walsh-Hadamard Matrix — Backbone of Grover’s Diffusion Operator

Part of the Quantum Computing: A Complete Learning Path series.

QUANTUM SERIES 2026
The mathematical foundation behind Grover’s diffusion operator — derived from first principles.

In the Grover’s Algorithm — Inversion About the Mean walkthrough, the diffusion operator applies H⊗³ twice per iteration. Every single step is governed by a sign table called the Hadamard Reference. That table is not a lookup shortcut — it is the 8×8 Walsh-Hadamard Transform matrix written out in full. This post derives it from scratch: one qubit, then two, then all three, arriving at the complete matrix and the rule behind every sign in it.

1 · The Circuit: Three Qubits, Three Hadamard Gates

We initialise all three qubits in the ground state |0⟩ and route each through its own independent Hadamard gate. There are no two-qubit (entangling) gates here — the circuit is entirely parallel.

Qubit	Input	Gate	Output ket
q₀	\|0⟩	H	(1/√2)( \|0⟩ + \|1⟩ )
q₁	\|0⟩	H	(1/√2)( \|0⟩ + \|1⟩ )
q₂	\|0⟩	H	(1/√2)( \|0⟩ + \|1⟩ )

All three outputs are identical because all three inputs are identical. The structure we need emerges when we take their tensor product.

2 · Single-Qubit Hadamard Action

The Hadamard gate H maps the two computational basis states as follows:

Input	H \|input⟩	Short notation
\|0⟩	(1/√2)( \|0⟩ + \|1⟩ )	\|+⟩
\|1⟩	(1/√2)( \|0⟩ − \|1⟩ )	\|−⟩

In matrix form:

      H  =  (1/√2)  
      
         +1   +1 

         +1   −1

Key property: H is its own inverse — H² = I. Every element has magnitude 1/√2, so tensoring three copies multiplies the magnitudes to 1/√8 while the signs follow a precise bitwise pattern.

3 · Two-Qubit Tensor Product: q₀ ⊗ q₁

Expanding the tensor product of the first two post-H qubits:

      |+⟩ ⊗ |+⟩

        = (1/√2)(|0⟩ + |1⟩) ⊗ (1/√2)(|0⟩ + |1⟩)

        = (1/2)( |0⟩⊗|0⟩ + |0⟩⊗|1⟩ + |1⟩⊗|0⟩ + |1⟩⊗|1⟩ )

        = (1/2)( |00⟩ + |01⟩ + |10⟩ + |11⟩ )

All four two-qubit basis states appear with equal amplitude 1/2. Measurement probability per state: (1/2)² = 25%.

4 · Three-Qubit Tensor Product: q₀ ⊗ q₁ ⊗ q₂

Adding the third qubit expands the superposition to all 8 three-bit strings:

      |+⟩ ⊗ |+⟩ ⊗ |+⟩

        = (1/√2)³ (|0⟩+|1⟩) ⊗ (|0⟩+|1⟩) ⊗ (|0⟩+|1⟩)

        = (1/√8)( |000⟩ + |001⟩ + |010⟩ + |011⟩

                  + |100⟩ + |101⟩ + |110⟩ + |111⟩ )

This is |ψ_init⟩ — the uniform superposition over all 8 basis states that opens Grover’s algorithm (Phase 0 in the walkthrough). Each state carries amplitude +1/√8 ≈ 0.3535 and measurement probability 1/8 = 12.5%. All signs are positive because we only applied H to |0⟩ inputs — the sign variation appears when H⊗³ acts on states other than |000⟩.

5 · The 8×8 Walsh-Hadamard Sign Matrix

When H⊗³ is applied to an arbitrary basis state |j⟩, the result is:

      H⊗³ |j⟩  =  (1/√8)   Σᵢ   (−1)popcount(i AND j)   |i⟩
    

The entry at row i, column j carries sign (−1)^{popcount(i AND j)} divided by √8. The table below shows all 64 signs — green (+) for +1/√8 and red (−) for −1/√8:

H⊗³ \|j⟩ → output \|i⟩ ↓	\|000⟩	\|001⟩	\|010⟩	\|011⟩	\|100⟩	\|101⟩	\|110⟩	\|111⟩
H\|000⟩	+	+	+	+	+	+	+	+
H\|001⟩	+	−	+	−	+	−	+	−
H\|010⟩	+	+	−	−	+	+	−	−
H\|011⟩	+	−	−	+	+	−	−	+
H\|100⟩	+	+	+	+	−	−	−	−
H\|101⟩	+	−	+	−	−	+	−	+
H\|110⟩	+	+	−	−	−	−	+	+
H\|111⟩	+	−	−	+	−	+	+	−

      +  =  amplitude +1/√8 ≈ +0.3535    
      −  =  amplitude −1/√8 ≈ −0.3535
    

6 · Why the Sign is (−1)^{popcount(i AND j)}

Because H acts independently on each qubit, H⊗³ is the tensor product of three 2×2 matrices. The entry at row i, column j is simply the product of the three corresponding single-qubit entries:

      H⊗³[i, j]  =  H[i₀, j₀]  ×  H[i₁, j₁]  ×  H[i₂, j₂]
    

Each single-qubit factor equals +1 unless both the k-th bit of i and the k-th bit of j are 1, in which case it equals −1. So the k-th factor contributes a sign of (−1)^iₖ·jₖ. Multiplying all three:

      sign(i, j)  =  (−1)i₀j₀ + i₁j₁ + i₂j₂  =  (−1)popcount(i AND j)
    

The rule in plain terms: bitwise AND the row index and the column index, count the 1-bits, check parity. Even count → positive. Odd count → negative.

Quick verification: row H|101⟩, column |011⟩

i (row)	j (col)	i AND j	popcount	Sign
101 (= 5)	011 (= 3)	001	1 (odd)	− ✓

Matches the matrix in Section 5: row H|101⟩, column |011⟩ is indeed −.

7 · Connection to Grover’s Diffusion Operator

This matrix is the Hadamard Reference table used throughout the Grover’s Algorithm — Inversion About the Mean post. The diffusion operator D = H⊗³ (2|0⟩⟨0| − I) H⊗³ works in three sub-steps, each directly using this matrix:

Sub-step	Operation	Grover walkthrough steps
First H⊗³	Maps computational basis → Hadamard basis. Each amplitude spreads across all 8 columns via the sign table.	4.1 · 6.1 · 8.1
Phase flip	2\|0⟩⟨0\|−I: keeps \|000⟩ unchanged, negates all other states. This is the inversion-about-the-mean mechanism.	4.2 · 6.2 · 8.2
Second H⊗³	Maps back to computational basis using the same sign table (H is self-inverse). Routes constructive interference into the target state.	4.3 · 6.3 · 8.3

The bottom line: without the sign structure of the Walsh-Hadamard matrix, neither the uniform superposition (Phase 0) nor the diffusion step (every iteration) would work. The matrix is the silent engine behind Grover’s quadratic speedup.

Quantum Series 2026 · Built with Qiskit 1.x

✦ This article was generated with the assistance of Claude by Anthropic ✦

Mathematical Patterns: The Curious Case of 1/998001

Mathematics reveals elegant patterns in unexpected places. Consider 1/998001, which equals 0.000001002003004… containing every three-digit number in sequence.

This occurs because 998001 = 999². Similar patterns emerge in related fractions:

1/9 = 0.111111…
1/99 = 0.010101…
1/999 = 0.001001001…

These numerical sequences demonstrate that mathematics is not merely computational but reveals fundamental structures underlying our universe. Such patterns have practical applications in algorithm development, cryptography, and data analysis.

The ordered nature of these mathematical curiosities reminds us that even within apparent complexity, we can discover remarkable simplicity and structure.

Australian Mathematicians Debunk ‘Infinite Monkey Theorem’ – Slashdot

https://science.slashdot.org/story/24/11/01/0448202/australian-mathematicians-debunk-infinite-monkey-theorem

Quantum Fourier Transform (QFT) of a Single Qubit is Hadamard Transform

Part of the Quantum Computing: A Complete Learning Path series.

Below is the definition of QFT as illustrated in the YouTube lecture by Abraham Asfaw.

The LaTex code for the equation is as follows and also available here.

Latex
| \tilde{x} \rangle \equiv ~ QFT ~ |x \rangle ~ \equiv \frac{1}{\sqrt{N}}\sum_{y=0}^{N-1}{e^{\frac{2\pi ix y}{N}}} ~| y \rangle

For the one qubit case, N = 2¹ = 2:

Latex
| \tilde{x} \rangle \equiv ~ QFT ~ |x \rangle ~ \equiv \frac{1}{\sqrt{}N}\sum_{y=0}^{N-1}{e^{\frac{2\pi ix y}{N}}} ~| y \rangle

Latex
\frac{1}{\sqrt{2}}\sum_{y=0}^{1}{e^{\pi ix y}} ~| y \rangle = \frac{1}{\sqrt{2}}[~e^{i \pi x 0}~ | 0 \rangle ~ + ~ e^{i \pi x 1}~| 1 \rangle] = \frac{1}{\sqrt{2}}[~|0\rangle ~+~e^{i \pi x}~|1 \rangle~]

When x = 0:

Latex
QFT~| 0 \rangle = \frac{1}{\sqrt{2}}[~|0\rangle ~+~e^{i \pi 0}~|1 \rangle~] = \frac{1}{\sqrt{2}}[~| 0 \rangle + |1 \rangle~] = |+\rangle

When x = 1:

Latex
QFT~| 1 \rangle = \frac{1}{\sqrt{2}}[~|0\rangle ~+~e^{i \pi 1}~|1 \rangle~] = \frac{1}{\sqrt{2}}[~| 0 \rangle - |1 \rangle~] = |-\rangle

Hence the QFT of a single qubit is essentially the Hadamard transform.

Monty Hall Problem

https://www.instagram.com/reel/C1hOMIUv7Cs/?igsh=dmwza3JxMGN4MGVk

Chladni Figure (2D Standing Wave)

https://www.facebook.com/share/v/e3xtU31kHAjd84nh/?mibextid=oEMz7o

The answer to life, the universe, and everything | MIT News | Massachusetts Institute of Technology

https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910

Reductions among number theoretic problems – ScienceDirect

https://www.sciencedirect.com/science/article/pii/0890540187900307

How Machines Understand Language

2.1 Euclidean (Cartesian) Distance

2.2 Cosine Similarity — The Industry Standard

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: