文本嵌入（Text Embedding）简介

本文为文本嵌入技术的全面概览，包括基本概念、主要算法、实现库、模型评估、向量检索方法以及应用。

本文内容

基本概念
算法
实现库和预训练模型
模型的评估
向量的检索
1. 向量搜索库
2. 向量数据库
应用简介
1. 搜索
2. RAG
商业公司

基本概念

在 NLP 中，传统的编码方式是 One-Hot 编码。其中每个单词或特征都映射到一个唯一的整数索引，并转换为一个高维的稀疏向量。在这个向量中，对应单词索引的位置为 1，其他位置为 0。以只有 5 个字的数据集（我爱你中华）为例，“我”可能被编码为 [1, 0, 0, 0, 0]，“爱”被编码为 [0, 1, 0, 0, 0]，“你”被编码为 [0, 0, 1, 0, 0] 等。于 2023 年 8 月实施的《信息技术中文编码字符集》（GB 18030-2022），共收录汉字 87887 个，如果以简单的 One-Hot 编码，向量维度高达 8 万多。显然，One-Hot 这种方法简单直观，但向量维度过高且稀疏。它的另一个缺点是无法有效地捕捉词汇之间的语义关系，例如东、南、西、北都表示方位，用 One-Hot 编码方式，各个字编码之间没有关联。

“Continuous vector representations of words” 指的是词的连续向量表示，也被称为词向量（word vectors）或词嵌入（word embeddings）。这是自然语言处理（NLP）中用于表示词汇的一种方法，它允许我们将词汇从离散的符号转换为连续的向量空间中的点。这种表示方法使得词语之间可以计算相似性，并且可以被机器学习模型有效地利用。主要特点如下。

连续性（Continuous）：与传统的独热编码（one-hot encoding）不同，词向量是连续的，用浮点数，而不是只有0和1的向量。这种连续性使得词向量之间可以计算距离（如欧几里得距离或余弦相似度），从而捕捉词之间的语义和语法关系。
维度：词向量通常有一个固定的维度，例如 128、256、1024 等。
捕获语义关系：通过训练，词向量能够捕获词之间的语义关系。例如，向量(“king”) - 向量(“man”) + 向量(“woman”) 的结果可能会接近向量(“queen”)。

算法

Transformers 系列（主流算法）

BERT 系列（Encoder）

原生 BERT
Sentence-BERT（SBERT）
1. 论文：Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks （2019.08，Technische Universit¨at Darmstadt，德国）
SimCSE: Simple Contrastive Learning of Sentence Embeddings （2021.04，普林斯顿大学、清华大学）
1. 亮点：无监督对比学习；Dropout 作为数据增强
ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer （2021.05，北京邮电大学、美团）

GPT 系列（Decoder）

论文：SGPT: GPT Sentence Embeddings for Semantic Search （2022.02，北京大学）
代码：Muennighoff/sgpt ：SGPT
1. 亮点：利用 GPT 模型生成句子嵌入（Sentence Embeddings）。

Word2Vec（速度快）

以下两篇论文共同构成了 word2vec 方法的基础，对于自然语言处理领域的词嵌入技术具有开创性的意义。

论文：Efficient Estimation of Word Representations in Vector Space （2013.01，Google）
1. 主要贡献点：提出了 Continuous Bag-of-Words（CBOW）和 Skip-gram 模型，通过神经网络将词汇转换为向量表示，显著提升了语义和句法相似性任务的性能。
  1. CBOW：通过上下文来预测中心词。
  2. Skip-gram：通过一个词来预测其上下文。
论文：Distributed Representations of Words and Phrases and their Compositionality （2013.10，Google）
1. 主要贡献点：训练方法上的改进点。
  1. 高频词的子采样（subsampling of the frequent words）：降低高频词在训练中的权重。
  2. 负采样（negative sampling）：通过从非上下文中随机采样负例，与正例一同用于训练。

Word2Vec 的一些参考资料： On word embeddings - Part 1， 2， 3， The Illustrated Word2vec， Word2Vec Tutorial - The Skip-Gram Model和 Negative Sampling。

GloVe

GloVe（Global Vectors for Word Representation，全局向量的词嵌入）（代码库：GloVe stars ），是一种基于全局统计信息和矩阵分解的词嵌入方法。它通过对整个语料库中的词汇共现矩阵进行矩阵分解，得到词汇的向量表示。

其他

BM25 介绍：The Probabilistic Relevance Framework: BM25 and Beyond 论文 arXiv: 的 Semantic Scholar 引用数（2009.04，-）

实现库和预训练模型

FlagEmbedding

FlagEmbedding stars 是智源研究院开源的 LLMs 检索增强库，包括 Embedding 预训练模型、Reranker 预训练模型、评估工具等。

常用中文预训练模型有 bge-m3、 bge-large-zh-v1.5。

某评测显示，目前 bge-m3 效果最好。

以下为使用 bge-large-zh-v1.5 的示例。

import os
from FlagEmbedding import FlagModel

# 先从 https://huggingface.co/BAAI/bge-large-zh-v1.5 下载模型文件到某个目录。
model_path = os.path.expanduser('~/Downloads')  

sentences_1 = [
    "我喜欢看电影，特别是科幻电影。",
    "今天的天气真晴朗，适合外出散步。",
    ]
sentences_2 = [
    "阳光明媚的今天，很适合去户外走走。",
    "看电影是我的爱好，尤其是科幻类的。",
    "天气对运动和心情都有显著的影响。",
    "这幅画是由一位著名画家创作的，展现了他对大自然的热爱。"
    ]
model = FlagModel(model_path + '/bge-large-zh-v1.5',
                query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：",
                use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# 输出
#  [[0.3157 0.9507 0.3777 0.3667]
#  [0.8413 0.307  0.568  0.385 ]]

# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
# corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
queries = [
    '青莲居士是谁', 
    'query_2'
    ]
passages = [
    "李白字太白，号青莲居士，著有《李太白集》，代表作有《望庐山瀑布》《行路难》等。", 
    "样例文档-2"
    ]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
scores = q_embeddings @ p_embeddings.T
print(scores)
# 输出
#  [[0.501  0.0868]
#  [0.1581 0.385 ]]

Sentence Transformers

代码库：Sentence Transformers stars ：Multilingual Sentence & Image Embeddings with BERT

Gensim

代码库：gensim stars ：Topic Modelling for Humans

其他

text2vec ：文本向量表征工具，把文本转化为向量矩阵，实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型
fastText ：不再更新
BCEmbedding ：有道的开源 embedding and reranker models for RAG products.

模型的评估

论文：MTEB: Massive Text Embedding Benchmark （2022.10，Hugging Face）
代码：embeddings-benchmark/mteb ：大规模文本嵌入评估
中文文本嵌入评估：CMTEB

向量的检索

向量搜索库

Approximate Nearest Neighbor（ANN）是一种用于在大规模数据集中寻找最近邻居的算法。其目标是在尽可能短的时间内找到与给定查询点最近的数据点，但不一定是确切的最近邻。为了达到这个目标，ANN使用了一些启发式方法，例如剪枝和近似搜索，来加速最近邻搜索的过程。

作为向量数据库的核心功能，以下为几个开源的 ANN 库（参考 Semantic Search）：

FAISS ：A library for efficient similarity search and clustering of dense vectors.
1. 论文：The Faiss library （2024.01，Meta）
  1. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors.
2. 论文：Billion-scale similarity search with GPUs （2017.02，Meta）
3. faiss 首页有其实现的基础论文列表。
spotify/annoy ：Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
hnswlib ：Header-only C++/python library for fast approximate nearest neighbors （sbert 评估效果不错）.
微软 SPTAG ：A library for fast approximate nearest neighbor search

为了有个直观的印象，以下为使用 FAISS 的简单示例。

# pip3 install faiss-cpu 
import faiss
import numpy as np

# 准备数据集和查询向量
d = 5  # embedding 向量的维度
nb = 200  # 数据集中向量数
nq = 3  # 查询词中向量数
np.random.seed(42)
# 数据集
xb = np.random.random((nb, d)).astype('float32')
# 构造的伪造 id，实际生成环境中，一般用某个主键。
ids = range(6789, 6789+nb) 
# 查询集
xq = np.random.random((nq, d)).astype('float32')

nlist = 5  # Number of clusters
quantizer = faiss.IndexFlatL2(d)
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist)
# 由于涉及到分簇，增加索引前需要训练索引
index_ivf.train(xb)
# 将数据集加到索引中
index_ivf.add_with_ids(xb, ids) # 假如用自增 id，使用 add 函数 
print(f"Index total dataset vectors: {index_ivf.ntotal}")
# 输出：Index total dataset vectors: 200

# Need to do this to search an IVF index
index_ivf.nprobe = 2
# 搜索结果中返回的近邻数量
k = 4
# 执行搜索
D, I = index_ivf.search(xq, k)
print(f"Distances: {D}")
print(f"Indices: {I}")
# 示例输出
# Distances: [[0.0742211  0.07672412 0.07856286 0.08607544]
#  [0.07868006 0.0847766  0.10127702 0.12898237]
#  [0.11713254 0.17307967 0.19850121 0.28180796]]
# Indices: [[6832 6807 6881 6828]
#  [6842 6967 6831 6814]
#  [6814 6820 6911 6942]]