OutputParser：从"后处理字符串"到"结构化输出生成器"

在 LLM 应用中，将模型的文本输出转换为结构化数据是一个常见且重要的需求。LangChain V3 通过 OutputParser 机制，将传统的"后处理字符串"方式提升为现代化的"结构化输出生成器"。这种转变不仅提高了数据处理的准确性和可靠性，还增强了类型安全和错误处理能力。本章将深入探讨 OutputParser 的设计和实现。

传统后处理方式的问题

在早期的 LLM 应用中，开发者通常通过简单的字符串处理来提取所需信息：

typescript

// 传统方式的示例
async function getTranslation() {
  const response = await llm.invoke("将'Hello World'翻译成中文");
  // 简单的字符串处理
  const translation = response.trim();
  return translation;
}

async function getPersonInfo() {
  const response = await llm.invoke("提取张三的信息：年龄30岁，职业工程师");
  // 复杂的正则表达式处理
  const ageMatch = response.match(/年龄(\d+)岁/);
  const professionMatch = response.match(/职业(\w+)/);
  
  return {
    name: "张三",
    age: ageMatch ? parseInt(ageMatch[1]) : null,
    profession: professionMatch ? professionMatch[1] : null
  };
}

这种方式存在以下问题：

脆弱性 - 对模型输出格式的微小变化敏感
维护困难 - 正则表达式复杂且难以维护
错误处理不足 - 缺乏结构化的错误处理机制
类型不安全 - 没有编译时类型检查

OutputParser 的现代化设计

LangChain V3 通过 OutputParser 解决了上述问题：

typescript

abstract class BaseOutputParser<T> implements Runnable<string, T> {
  abstract parse(text: string): Promise<T> | T;
  
  async invoke(input: string): Promise<T> {
    return await this.parse(input);
  }
  
  async batch(inputs: string[]): Promise<T[]> {
    return await Promise.all(inputs.map(input => this.parse(input)));
  }
  
  // 可选的流式解析方法
  async *stream(input: string): AsyncGenerator<T> {
    const result = await this.parse(input);
    yield result;
  }
}

JsonOutputParser 实现

JsonOutputParser 是最常见的 OutputParser 之一，用于将 JSON 格式的输出解析为结构化数据：

typescript

class JsonOutputParser<T extends Record<string, any> = Record<string, any>> 
  extends BaseOutputParser<T> {
  
  private schema?: any; // 可以集成 Zod 或其他验证库
  
  constructor(schema?: any) {
    super();
    this.schema = schema;
  }
  
  async parse(text: string): Promise<T> {
    // 清理文本，提取 JSON 部分
    const jsonText = this.extractJson(text);
    
    try {
      const parsed = JSON.parse(jsonText);
      
      // 如果提供了 schema，进行验证
      if (this.schema) {
        return this.validate(parsed);
      }
      
      return parsed;
    } catch (error) {
      throw new OutputParserException(
        `无法解析 JSON: ${error.message}`,
        text,
        error
      );
    }
  }
  
  private extractJson(text: string): string {
    // 查找 JSON 对象或数组
    const objectMatch = text.match(/\{[\s\S]*\}/);
    const arrayMatch = text.match(/\[[\s\S]*\]/);
    
    if (objectMatch) {
      return objectMatch[0];
    }
    
    if (arrayMatch) {
      return arrayMatch[0];
    }
    
    // 如果没有找到明确的 JSON 结构，返回原文本
    return text.trim();
  }
  
  private validate(data: any): T {
    // 如果使用 Zod 进行验证
    if (this.schema && typeof this.schema.parse === 'function') {
      try {
        return this.schema.parse(data);
      } catch (error) {
        throw new OutputParserException(
          `JSON 验证失败: ${error.message}`,
          JSON.stringify(data),
          error
        );
      }
    }
    
    return data;
  }
}

// 使用示例
interface PersonInfo {
  name: string;
  age: number;
  profession: string;
}

const jsonParser = new JsonOutputParser<PersonInfo>();

const llmOutput = `
{
  "name": "张三",
  "age": 30,
  "profession": "工程师"
}
`;

const personInfo = await jsonParser.parse(llmOutput);
// personInfo 的类型是 PersonInfo，具有完整的类型安全

PydanticOutputParser 实现

PydanticOutputParser（在 TypeScript 中可以使用类似的模式）提供更强大的类型验证：

typescript

class PydanticOutputParser<T> extends BaseOutputParser<T> {
  private schema: any; // 类似于 Pydantic 的模型定义
  
  constructor(schema: any) {
    super();
    this.schema = schema;
  }
  
  async parse(text: string): Promise<T> {
    // 提取 JSON
    const jsonText = this.extractJson(text);
    
    try {
      const parsed = JSON.parse(jsonText);
      // 使用 schema 验证和转换数据
      return this.schema.validate(parsed);
    } catch (error) {
      if (error instanceof SyntaxError) {
        throw new OutputParserException(
          `JSON 解析失败: ${error.message}`,
          text,
          error
        );
      } else {
        throw new OutputParserException(
          `数据验证失败: ${error.message}`,
          jsonText,
          error
        );
      }
    }
  }
  
  private extractJson(text: string): string {
    // 更智能的 JSON 提取逻辑
    const codeBlockMatch = text.match(/```(?:json)?\s*([\s\S]*?)\s*```/);
    if (codeBlockMatch) {
      return codeBlockMatch[1];
    }
    
    const jsonMatch = text.match(/(\{[\s\S]*\}|\[[\s\S]*\])/);
    if (jsonMatch) {
      return jsonMatch[1];
    }
    
    return text.trim();
  }
  
  getFormatInstructions(): string {
    // 生成格式说明，可以提供给 LLM
    return `请以以下 JSON 格式回答:
${JSON.stringify(this.schema.example(), null, 2)}`;
  }
}

XMLOutputParser 实现

对于需要 XML 格式输出的场景：

typescript

class XMLOutputParser<T> extends BaseOutputParser<T> {
  private rootTag: string;
  
  constructor(rootTag: string) {
    super();
    this.rootTag = rootTag;
  }
  
  async parse(text: string): Promise<T> {
    const xmlText = this.extractXML(text);
    
    try {
      const parsed = await this.parseXML(xmlText);
      return parsed as T;
    } catch (error) {
      throw new OutputParserException(
        `XML 解析失败: ${error.message}`,
        xmlText,
        error
      );
    }
  }
  
  private extractXML(text: string): string {
    const tagPattern = `<${this.rootTag}[\\s\\S]*?</${this.rootTag}>`;
    const match = text.match(new RegExp(tagPattern, 'i'));
    
    if (match) {
      return match[0];
    }
    
    // 如果没有找到完整的标签，返回清理后的文本
    return text.replace(/```xml\s*|\s*```/g, '').trim();
  }
  
  private async parseXML(xmlText: string): Promise<Record<string, any>> {
    // 简化的 XML 解析实现
    // 在实际应用中，可以使用 xml2js 或类似的库
    const result: Record<string, any> = {};
    
    // 匹配所有标签
    const tagMatches = xmlText.match(/<(\w+)>([^<]*)<\/\1>/g);
    
    if (tagMatches) {
      for (const tagMatch of tagMatches) {
        const match = tagMatch.match(/<(\w+)>([^<]*)<\/\1>/);
        if (match) {
          const [, tagName, tagValue] = match;
          result[tagName] = tagValue.trim();
        }
      }
    }
    
    return result;
  }
}

结构化输出的实际应用

让我们看一个完整的实际应用示例，展示如何使用不同类型的 OutputParser：

typescript

// 定义数据结构
interface TranslationResult {
  originalText: string;
  translatedText: string;
  sourceLanguage: string;
  targetLanguage: string;
  confidence: number;
}

interface PersonInfo {
  name: string;
  age: number;
  email: string;
  skills: string[];
}

interface ProductReview {
  productName: string;
  rating: number;
  pros: string[];
  cons: string[];
  recommendation: boolean;
}

// 创建解析器
const translationParser = new JsonOutputParser<TranslationResult>();
const personParser = new JsonOutputParser<PersonInfo>();
const reviewParser = new JsonOutputParser<ProductReview>();

// 使用示例
class StructuredOutputExample {
  private llm: BaseChatModel;
  
  constructor(llm: BaseChatModel) {
    this.llm = llm;
  }
  
  async translateText(
    text: string, 
    sourceLang: string, 
    targetLang: string
  ): Promise<TranslationResult> {
    const prompt = new PromptTemplate({
      template: `请将以下文本翻译并以 JSON 格式返回结果:
原文: {text}
源语言: {sourceLang}
目标语言: {targetLang}

请返回以下格式的 JSON:
{
  "originalText": "原文",
  "translatedText": "译文",
  "sourceLanguage": "源语言",
  "targetLanguage": "目标语言",
  "confidence": 0.95
}`,
      inputVariables: ["text", "sourceLang", "targetLang"]
    });
    
    const chain = prompt.pipe(this.llm).pipe(translationParser);
    return await chain.invoke({ text, sourceLang, targetLang });
  }
  
  async extractPersonInfo(text: string): Promise<PersonInfo> {
    const prompt = new PromptTemplate({
      template: `从以下文本中提取个人信息并以 JSON 格式返回:
文本: {text}

请返回以下格式的 JSON:
{
  "name": "姓名",
  "age": 年龄,
  "email": "邮箱",
  "skills": ["技能1", "技能2"]
}`,
      inputVariables: ["text"]
    });
    
    const chain = prompt.pipe(this.llm).pipe(personParser);
    return await chain.invoke({ text });
  }
  
  async analyzeProductReview(review: string): Promise<ProductReview> {
    const prompt = new PromptTemplate({
      template: `分析以下产品评论并以 JSON 格式返回:
评论: {review}

请返回以下格式的 JSON:
{
  "productName": "产品名称",
  "rating": 评分(1-5),
  "pros": ["优点1", "优点2"],
  "cons": ["缺点1", "缺点2"],
  "recommendation": true/false
}`,
      inputVariables: ["review"]
    });
    
    const chain = prompt.pipe(this.llm).pipe(reviewParser);
    return await chain.invoke({ review });
  }
}

// 使用示例
const example = new StructuredOutputExample(new ChatOpenAI());

const translation = await example.translateText(
  "Hello, world!", 
  "English", 
  "Chinese"
);
console.log('翻译结果:', translation);

const personInfo = await example.extractPersonInfo(
  "张三，30岁，邮箱: zhangsan@example.com，擅长JavaScript和Python"
);
console.log('个人信息:', personInfo);

const reviewAnalysis = await example.analyzeProductReview(
  "这款手机拍照效果很好，电池续航也不错，但价格有点贵"
);
console.log('评论分析:', reviewAnalysis);

错误处理和恢复机制

现代 OutputParser 还需要完善的错误处理和恢复机制：

typescript

class ResilientOutputParser<T> extends BaseOutputParser<T> {
  private primaryParser: BaseOutputParser<T>;
  private fallbackParsers: BaseOutputParser<T>[];
  private maxRetries: number;
  
  constructor(
    primaryParser: BaseOutputParser<T>,
    fallbackParsers: BaseOutputParser<T>[] = [],
    maxRetries: number = 2
  ) {
    super();
    this.primaryParser = primaryParser;
    this.fallbackParsers = fallbackParsers;
    this.maxRetries = maxRetries;
  }
  
  async parse(text: string): Promise<T> {
    // 首先尝试主解析器
    try {
      return await this.primaryParser.parse(text);
    } catch (primaryError) {
      console.warn('主解析器失败，尝试备用解析器:', primaryError.message);
      
      // 尝试备用解析器
      for (const fallbackParser of this.fallbackParsers) {
        try {
          return await fallbackParser.parse(text);
        } catch (fallbackError) {
          console.warn('备用解析器失败:', fallbackError.message);
        }
      }
      
      // 如果所有解析器都失败，抛出原始错误
      throw primaryError;
    }
  }
}

// 创建具有恢复能力的解析器
const resilientParser = new ResilientOutputParser(
  new JsonOutputParser<PersonInfo>(),
  [
    new XMLOutputParser<PersonInfo>('person'),
    new SimpleStringParser() // 最后的备选方案
  ]
);

与 LCEL 的集成

OutputParser 可以无缝集成到 LCEL 管道中：

typescript

// 创建完整的处理链
const translationChain = 
  new PromptTemplate({
    template: "将 '{text}' 从 {sourceLang} 翻译为 {targetLang}",
    inputVariables: ["text", "sourceLang", "targetLang"]
  })
  .pipe(new ChatOpenAI())
  .pipe(new JsonOutputParser<TranslationResult>());

// 使用链
const result = await translationChain.invoke({
  text: "Hello, world!",
  sourceLang: "English",
  targetLang: "Chinese"
});
// result 具有完整的类型安全

总结

LangChain V3 的 OutputParser 机制将传统的"后处理字符串"方式提升为现代化的"结构化输出生成器"，带来了以下优势：

类型安全 - 通过 TypeScript 泛型提供完整的类型安全
错误处理 - 结构化的错误处理和恢复机制
可组合性 - 可以无缝集成到 LCEL 管道中
可扩展性 - 支持多种格式的解析器实现
验证能力 - 可以集成验证库确保数据质量

通过这些特性，开发者可以构建更加健壮和可靠的 LLM 应用，确保从模型输出到应用数据的转换过程既准确又安全。

在下一章中，我们将探讨 .parse() 成为 .invoke() 的一部分，错误时抛出 OutputParserException，深入了解 OutputParser 的错误处理机制。

JS 进阶指南

JavaScript 引擎揭秘

es6+ 新特性手册

TS 练习册

深入浅出TS类型

css 基础

Vue 原理

vue_router

pinia

node

nest

hono 进阶

langchain 入门

docker 实践

nginx

设计模式手册

深入浅出函数式编程

MySQL零基础入门

OutputParser：从"后处理字符串"到"结构化输出生成器"

传统后处理方式的问题

OutputParser 的现代化设计

JsonOutputParser 实现

PydanticOutputParser 实现

XMLOutputParser 实现

结构化输出的实际应用

错误处理和恢复机制

与 LCEL 的集成

总结

OutputParser：从"后处理字符串"到"结构化输出生成器" ​

传统后处理方式的问题 ​

OutputParser 的现代化设计 ​

JsonOutputParser 实现 ​

PydanticOutputParser 实现 ​

XMLOutputParser 实现 ​

结构化输出的实际应用 ​

错误处理和恢复机制 ​

与 LCEL 的集成 ​

总结 ​

OutputParser：从"后处理字符串"到"结构化输出生成器"

传统后处理方式的问题

OutputParser 的现代化设计

JsonOutputParser 实现

PydanticOutputParser 实现

XMLOutputParser 实现

结构化输出的实际应用

错误处理和恢复机制

与 LCEL 的集成

总结