每日读源码---Day1_similarity_search_with_score_by_vector

10 阅读 0 评论 0 点赞

#根据给定的嵌入向量在文档集合中查找最相似的文档
def similarity_search_with_score_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[Union[Callable, Dict[str, Any]]] = None,
        fetch_k: int = 20,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
        """Return docs most similar to query.

        Args:
            embedding: Embedding vector to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            filter (Optional[Union[Callable, Dict[str, Any]]]): Filter by metadata.
                Defaults to None. If a callable, it must take as input the
                metadata dict of Document and return a bool.
            fetch_k: (Optional[int]) Number of Documents to fetch before filtering.
                      Defaults to 20.
            **kwargs: kwargs to be passed to similarity search. Can include:
                score_threshold: Optional, a floating point value between 0 to 1 to
                    filter the resulting set of retrieved docs

        Returns:
            List of documents most similar to the query text and L2 distance
            in float for each. Lower score represents more similarity.
        """
        #函数开始时导入 faiss 库，这是一个用于高效相似性搜索和密集向量聚类的库
        faiss = dependable_faiss_import()
        #将输入的嵌入向量转换为适合 faiss 处理的格式
        vector = np.array([embedding], dtype=np.float32)
        #如果设置了 _normalize_L2，则对向量进行 L2 归一化
        if self._normalize_L2:
            faiss.normalize_L2(vector)
        #使用 faiss 索引执行搜索，根据提供的向量找到最相似的文档。
        scores, indices = self.index.search(vector, k if filter is None else fetch_k)
        docs = []
		#根据提供的过滤条件过滤结果。对于每个检索到的文档，计算与查询向量的相似度分数。
        if filter is not None:
            filter_func = self._create_filter_func(filter)

        for j, i in enumerate(indices[0]):
            if i == -1:
                # This happens when not enough docs are returned.
                continue
            _id = self.index_to_docstore_id[i]
            doc = self.docstore.search(_id)
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            if filter is not None:
                if filter_func(doc.metadata):
                    docs.append((doc, scores[0][j]))
            else:
                docs.append((doc, scores[0][j]))
		#如果提供了 score_threshold，则只保留高于或低于此阈值的文档（取决于距离策略）。
        score_threshold = kwargs.get("score_threshold")
        if score_threshold is not None:
            cmp = (
                operator.ge
                if self.distance_strategy
                in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
                else operator.le
            )
            docs = [
                (doc, similarity)
                for doc, similarity in docs
                if cmp(similarity, score_threshold)
            ]
        #返回最相似的文档列表，每个文档包括文档本身和对应的相似度分数
        return docs[:k]

embedding (List[float]): 要查找相似文档的嵌入向量。
k (int, 默认为 4): 要返回的文档数量。
filter (Optional[Union[Callable, Dict[str, Any]]], 默认为 None): 用于过滤结果的元数据。如果提供，可以是一个可调用的函数或字典。函数必须接收文档的元数据字典并返回一个布尔值。
fetch_k (int, 默认为 20): 在应用过滤之前要检索的文档数量。
kwargs (Any): 传递给相似性搜索的其他参数。【比如：score_threshold (Optional[float]): 一个介于0到1之间的浮点数，用于过滤返回的检索文档集合。】
返回一个列表，其中包含与查询文本最相似的文档和每个文档的 L2 距离（浮点数）。较低的分数表示更高的相似性。

本站资源均来自互联网，仅供研究学习，禁止违法使用和商用，产生法律纠纷本站概不负责！如果侵犯了您的权益请与我们联系！

转载请注明出处：免费源码网-免费的源码资源网站 » 每日读源码---Day1_similarity_search_with_score_by_vector

点赞(0) 打赏

本文分类：文章资讯
本文标签：每日读源码---Day1_similarity_search_with_score_by_vector
浏览次数：10 次浏览
本文链接：https://freeymw.com/article/25738.html

上一篇 > 基于微信的热门景点推荐小程序的设计与实现(论文+源码)_kaic
下一篇 > Spark-push-based shuffle

评论列表共有 0 条评论

暂无评论

每日读源码---Day1_similarity_search_with_score_by_vector

评论列表 共有 0 条评论

发表评论 取消回复

评论列表共有 0 条评论

发表评论取消回复