Text (or other content, like images, with the right model) can be embedded returning a vector
Most LLM providers have embeddings models/endpoints that can take input and return a vector
Vectors are represented as arrays of numbers (generally 32-bit floats)
They have a magnitude
This is the length of the vector. Length is NOT the number of components in the vector – that’s dimension, and is usually fixed per-model (512, 1536, etc)
They also have a direction
This is the actual numerical encoding of the meaning of the input
Normalization of vectors results in removing the impact of magnitude on similarity
Regarding similarity, similarity of vectors can be calculated multiple ways
Two very common ways are cosine similarity and inner product similarity (aka Dot Product)
For normalized vectors, cosine and inner product similarity are the same. This is because normalization yields a denominator of 1 in the cosine calculation.
The numerator of the cosine similarity calculation is the inner product calculation
Inner product calculation: A * B
A: [2,3,4]
B: [5,4,3]
Inner product: (2*5) + (3*4) + (4*3)
Cosine simlarity: A * B / ||A||*||B||
Double bars (||) means the length or norm of a vector
This is calculated by taking the square root of the sum of squares for the vector
In a vector database, a vector can be quantized, and stored as 16 bit values by using a halfvec data type
This might be specific to pg_vector?
So 0.01610957272350788 becomes something like 0.01611328125
Halfvec are almost the exact same value, but consume half the storage space
This reduces disk requirements and also increases computation speed with minimal impact on the result
Pair halfvec with inner product on normalized vectors (since cosine similarity on normalized vectors is the exact same result but with extra computation) for increased performance when comparing vectors