Chroma 向量数据库集成技术文档

1. 概述

本文档详细说明了系统中 Chroma 向量数据库的集成设计与实现，包括 Collection 和 Document 的设计原理以及它们之间的关联关系。

2. 核心概念

2.1 Collection（集合）

Collection 是 Chroma 中的一个逻辑容器，用于存储和组织相关的向量文档
每个 Collection 对应一个特定的业务领域或文档类型
在本系统中，Collection 与本地数据库中的 collections 表对应

2.2 Document（文档）

Document 是实际存储在 Collection 中的文本内容及其向量表示
每个 Document 包含原始文本内容和元数据信息
在本系统中，Document 与本地数据库中的 documents 表对应

3. 设计原理

3.1 为什么要设计 Collection 和 Document

3.1.1 数据组织

Collection A (如: 技术文档)
├── Document 1 (安装指南)
├── Document 2 (使用说明)
└── Document 3 (故障排除)

Collection B (如: 用户手册)
├── Document 4 (注册流程)
├── Document 5 (账户设置)
└── Document 6 (支付说明)

3.1.2 业务隔离

不同的 Collection 可以服务于不同的业务场景
便于权限控制和数据隔离
支持针对不同 Collection 的独立配置

3.1.3 性能优化

在特定的 Collection 中搜索比在整个向量空间中搜索更高效
可以为不同的 Collection 设置不同的索引策略和参数

3.2 Collection 与 Document 的关联关系

3.2.1 一对多关系

一个 Collection 可以包含多个 Document
每个 Document 必须属于一个 Collection
通过 collectionId 字段建立关联

3.2.2 双向同步机制

本地数据库记录与 Chroma 向量数据库保持同步
创建/更新/删除操作会同时作用于本地数据库和 Chroma 数据库
通过 chromaCollectionId 和 chromaDocumentId 字段维护映射关系

4. 技术实现

4.1 架构设计

┌─────────────────┐        ┌──────────────────┐        ┌─────────────────┐
│   前端界面       │───────▶│   业务服务层      │───────▶│   向量数据库     │
│ (上传/查询)      │        │ (Service)        │        │ (Chroma)        │
└─────────────────┘        └──────────────────┘        └─────────────────┘
                                    │                            │
                                    ▼                            ▼
                           ┌──────────────────┐        ┌─────────────────┐
                           │   本地数据库      │        │   向量表示       │
                           │ (MySQL)          │        │ (Embeddings)    │
                           └──────────────────┘        └─────────────────┘

4.2 数据同步流程

4.2.1 Collection 创建流程

在本地数据库创建 Collection 记录
在 Chroma 中创建对应的集合（携带嵌入函数配置）
将 Chroma 返回的集合 ID 同步到本地记录

4.2.2 Document 创建流程

在本地数据库创建 Document 记录
获取关联的 Collection 信息
在 Chroma 中对应的集合中添加文档向量
将 Chroma 返回的文档 ID 同步到本地记录

4.3 容错机制

4.3.1 连接失败处理

当 Chroma 服务不可用时，系统仍然可以创建和管理本地记录
系统会记录同步状态，待服务恢复后自动重试

4.3.2 嵌入函数配置

使用 AlibabaTongyiEmbeddings 作为统一的嵌入函数
确保在创建集合时正确传递嵌入函数配置
避免使用默认嵌入函数导致的依赖问题

5. 统一 Chroma 初始化逻辑

5.1 当前问题分析

目前系统中存在多处 Chroma 初始化代码，分布在不同的 Service 中：

RAGService
CollectionService
DocumentService

这种分散的初始化方式存在以下问题：

代码重复：多处重复的初始化逻辑
配置不一致：不同服务可能使用不同的配置参数
维护困难：修改配置需要在多处同步更新
资源浪费：可能创建多个相同的客户端实例

5.2 统一初始化方案设计

5.2.1 设计思路

创建一个统一的 Chroma 配置和初始化服务，供所有需要使用向量数据库的模块调用：

[ChromaConfigService] (单例)
        ↑ (依赖注入)
[ RAGService ] [ CollectionService ] [ DocumentService ]

5.2.2 实现要点

单例模式：确保整个应用只有一个 Chroma 客户端实例
延迟初始化：在首次使用时才进行连接，避免启动时阻塞
连接状态管理：维护连接状态，提供重连机制
配置集中管理：所有 Chroma 相关配置统一管理

5.2.3 推荐实现方式

typescript

@Injectable({ scope: Scope.DEFAULT })
export class ChromaService implements OnModuleInit {
  private chromaClient: ChromaClient;
  private embeddings: Embeddings;
  private isChromaAvailable = false;
  
  async onModuleInit() {
    try {
      // 初始化 Chroma 客户端
      this.chromaClient = new ChromaClient({
        host: process.env.CHROMA_HOST || 'localhost',
        port: parseInt(process.env.CHROMA_PORT) || 8000,
        ssl: process.env.CHROMA_SSL === 'true'
      });
      
      // 初始化嵌入函数
      const { MODELS_KEY } = GetConfig();
      this.embeddings = new AlibabaTongyiEmbeddings({
        verbose: true,
        apiKey: MODELS_KEY
      });
      
      // 测试连接
      await this.chromaClient.listCollections();
      this.isChromaAvailable = true;
    } catch (error) {
      console.error('Failed to initialize Chroma:', error.message);
      this.isChromaAvailable = false;
    }
  }
  
  // 获取 Chroma 客户端实例
  getClient(): ChromaClient | null {
    return this.chromaClient;
  }
  
  // 获取嵌入函数实例
  getEmbeddings(): Embeddings | null {
    return this.embeddings;
  }
  
  // 检查连接状态
  isAvailable(): boolean {
    return this.isChromaAvailable;
  }
  
  // 创建向量存储实例
  createVectorStore(collectionName: string): Chroma {
    if (!this.chromaClient || !this.embeddings) {
      throw new Error('Chroma not initialized');
    }
    
    return new Chroma(this.embeddings, {
      index: this.chromaClient,
      collectionName
    });
  }
}

5.3 改造现有服务

5.3.1 RAGService 改造

typescript

@Injectable()
export class RAGService implements OnModuleInit {
  private vectorStore: Chroma;

  constructor(private readonly chromaService: ChromaService) {}

  async onModuleInit() {
    if (this.chromaService.isAvailable()) {
      this.vectorStore = this.chromaService.createVectorStore('blog-content');
    }
  }
}

5.3.2 CollectionService 改造

typescript

@Injectable()
export class CollectionService {
  constructor(
    private readonly chromaService: ChromaService,
    // 其他依赖...
  ) {}
  
  async create(createCollectionDto: Partial<Collection>): Promise<Collection> {
    // 本地操作...
    
    // 向量数据库操作
    if (this.chromaService.isAvailable()) {
      try {
        const chromaClient = this.chromaService.getClient();
        const embeddings = this.chromaService.getEmbeddings();
        const chromaCollection = await chromaClient.createCollection({
          name: collection.name,
          metadata: collection.metadata || undefined,
          embeddingFunction: embeddings
        });
        // 后续操作...
      } catch (error) {
        // 错误处理...
      }
    }
  }
}

5.4 优势分析

配置统一：所有 Chroma 相关配置集中管理
资源优化：避免重复创建客户端实例
维护性提升：配置修改只需在一处进行
状态一致：连接状态在整个应用中保持一致
易于扩展：添加新的配置项或功能更方便

6. FileService 模块设计

6.1 职责与定位

FileService 是一个专门负责文件处理的服务模块，其主要职责包括：

接收并处理用户上传的各种格式文件（PDF、Markdown、Text等）
解析文件内容并提取文本信息
对大文本进行分块处理，以适应向量数据库的限制
返回结构化的解析结果给调用方

6.2 与其他模块的关系

[前端上传] 
    ↓
[DocumentController] → [DocumentService]
                            ↓ (调用)
                        [FileService] → 解析文件并返回结果
                            ↓
                    [DocumentService] → 创建文档记录并同步到Chroma

6.3 设计原则

无状态性：FileService 不保存任何文件处理的中间状态
单一职责：只负责文件解析，不涉及数据持久化
可复用性：其他模块也可以调用 FileService 进行文件处理
可扩展性：支持添加新的文件格式解析器

7. API 接口设计

7.1 Collection 相关接口

POST /collections/create - 创建集合
GET /collections/list - 获取集合列表
GET /collections/:id - 获取集合详情
PUT /collections/:id - 更新集合
DELETE /collections/:id - 删除集合

7.2 Document 相关接口

POST /documents/upload - 上传文档
GET /documents/list - 获取文档列表
GET /documents/:id - 获取文档详情
PUT /documents/:id - 更新文档
DELETE /documents/:id - 删除文档
POST /documents/search - 搜索文档

8. 最佳实践

8.1 命名规范

Collection 名称应具有业务意义且全局唯一
Document 的元数据应包含足够的上下文信息

8.2 性能优化

合理规划 Collection 的粒度
对大文档进行适当的分块处理
使用合适的元数据索引策略

8.3 错误处理

始终检查 Chroma 连接状态
提供清晰的错误信息和重试机制
保证本地操作的完整性，即使向量数据库操作失败

9. 常见问题与解决方案

9.1 嵌入函数配置问题

问题: No embedding function configuration found解决方案: 确保在创建集合时传递嵌入函数

9.2 连接失败问题

问题: Failed to connect to chromadb解决方案: 检查网络连接和 Chroma 服务状态

9.3 数据同步问题

问题: 本地记录与向量数据库不同步 解决方案: 检查同步状态字段，实现自动重试机制

Chroma 向量数据库集成技术文档 ​

1. 概述 ​

2. 核心概念 ​

2.1 Collection（集合） ​

2.2 Document（文档） ​

3. 设计原理 ​

3.1 为什么要设计 Collection 和 Document ​

3.1.1 数据组织 ​

3.1.2 业务隔离 ​

3.1.3 性能优化 ​

3.2 Collection 与 Document 的关联关系 ​

3.2.1 一对多关系 ​

3.2.2 双向同步机制 ​

4. 技术实现 ​

4.1 架构设计 ​

4.2 数据同步流程 ​

4.2.1 Collection 创建流程 ​

4.2.2 Document 创建流程 ​

4.3 容错机制 ​

4.3.1 连接失败处理 ​

4.3.2 嵌入函数配置 ​

5. 统一 Chroma 初始化逻辑 ​

5.1 当前问题分析 ​

5.2 统一初始化方案设计 ​

5.2.1 设计思路 ​

5.2.2 实现要点 ​

5.2.3 推荐实现方式 ​

5.3 改造现有服务 ​

5.3.1 RAGService 改造 ​

5.3.2 CollectionService 改造 ​

5.4 优势分析 ​

6. FileService 模块设计 ​

6.1 职责与定位 ​

6.2 与其他模块的关系 ​

6.3 设计原则 ​

7. API 接口设计 ​

7.1 Collection 相关接口 ​

7.2 Document 相关接口 ​

8. 最佳实践 ​

8.1 命名规范 ​

8.2 性能优化 ​

8.3 错误处理 ​

9. 常见问题与解决方案 ​

9.1 嵌入函数配置问题 ​

9.2 连接失败问题 ​

9.3 数据同步问题 ​

Chroma 向量数据库集成技术文档

1. 概述

2. 核心概念

2.1 Collection（集合）

2.2 Document（文档）

3. 设计原理

3.1 为什么要设计 Collection 和 Document

3.1.1 数据组织

3.1.2 业务隔离

3.1.3 性能优化

3.2 Collection 与 Document 的关联关系

3.2.1 一对多关系

3.2.2 双向同步机制

4. 技术实现

4.1 架构设计

4.2 数据同步流程

4.2.1 Collection 创建流程

4.2.2 Document 创建流程

4.3 容错机制

4.3.1 连接失败处理

4.3.2 嵌入函数配置

5. 统一 Chroma 初始化逻辑

5.1 当前问题分析

5.2 统一初始化方案设计

5.2.1 设计思路

5.2.2 实现要点

5.2.3 推荐实现方式

5.3 改造现有服务

5.3.1 RAGService 改造

5.3.2 CollectionService 改造

5.4 优势分析

6. FileService 模块设计

6.1 职责与定位

6.2 与其他模块的关系

6.3 设计原则

7. API 接口设计

7.1 Collection 相关接口

7.2 Document 相关接口

8. 最佳实践

8.1 命名规范

8.2 性能优化

8.3 错误处理

9. 常见问题与解决方案

9.1 嵌入函数配置问题

9.2 连接失败问题

9.3 数据同步问题