Node 中的全链路式日志标记及处理 - 哈喽比特

736次阅读 | 发布于5年以前

是 「山月七八月原创计划」 中的「第五篇」文章，简述了在 Node 服务中的全链路日志如何串起来

本篇文章开始之前先抛出两个问题:

当在客户端捕捉到一条异常请求时，如何有效地排查问题？
当在生产环境中发现某条 API 高延迟，又如何定位问题？

从挂下来的蜘蛛丝可以找到蜘蛛的所在，从灶马爬过留下的痕迹可以查出灶马的去向。

而要更有效解决此类问题，我们需要依赖全链路式的日志作为蛛丝马迹。如当发现测试环境某条 API 延迟过高时，通过该 API 在日志系统中找到所涉及到的所有关键逻辑及数据库查询，查找是否 SQL 查询过多或其中有慢查询所致，或者是否被上游服务拖累。

在微服务架构中，标记全链路日志有助于更好的解决 bug 和分析接口性能，本篇文章介绍在 Node 中如何标记全链路式日志

当一个请求到来时，服务器端会产生哪些日志

AccessLog: 这是最常见的日志类型，一般在 nginx 等方向代理中也有日志记录，但在业务系统中有时需要更详细的日志记录，如 API 耗时，详细的 request body 与 response body
SQLLog: 关于数据库查询的日志，记录 SQL、涉及到的 table、以及执行时间，「从此可以筛选出执行过慢的 SQL，也可以筛选出某条 API 对应的 SQL 条数」
RequestLog: 请求第三方服务产生的日志
Exception: 异常
RedisLog: 缓存，也有一些非缓存的操作如 zset 及分布式锁等
Message Queue Log: 记录生产消息及消费消息的日志
CronLog: 记录定时任务执行的时间以及是否成功
关键业务逻辑

如何标记全链路上所有日志

「使用 requestId 唯一标识每个请求，有时它又被称为 sessionId 或者 transactionId，在更多情况下它被称作 traceId」

操作步骤如下

使用一个中间件获取 requestId，并存储到 Context 中。中间件通过读取请求头 X-Request-Id 来获取，如果请求方未设置，则自动生成
使 requestId 在整个链路进行手动传递，如读取 context.requestId，手动注入到 ORM 进行 SQL 查询的日志中

// 关于处理 requestId 的 koa middleware
async function context (ctx: KoaContext, next: any) {
  // 从前端获取到 requestId，如果前端没有，则自己生成随机Id
  const requestId = ctx.header['x-request-id'] || uuid()
  // 设置 requestId 的响应头
  ctx.res.setHeader('X-Request-Id', requestId)
  // 存储到 Context 中
  ctx.requestId = requestId
  await next()
}

app.use('/todos/:id', (ctx) => {
  const user = await User.findByPk(ctx.body.id, {
    logging (sql) {
      loger.log(sql, {
        // 从 Context 中获取信息手动传递 requestId
        requestId: ctx.requestId
      })
    }
  })
})

「显而易见，这样手动层层传递很繁琐，特别是在崇尚分层的服务器架构中，这样可能需要传递五六层」 此时需要以更小侵入性的方式来标记 requestId

降低侵入性

如上，在每次数据库查询时手动对 requestId 进行标记过于繁琐，何况除了与数据库交互，还要有诸多微服务进行交互。

此时可以统一设计 logger 函数进行标记，并且使用 CLS (Continues Local Storage) 来管理异步资源中的 requestId。

关于 CLS 的实现，实例中用到了 cls-hooked，它关于异步资源生命周期追踪的原理是 async_hooks。

通过 async_hooks[1] 可以追踪异步行为的生命周期
通过 cls-hooked[2] 可以获得每次异步请求的 requestId

如下代码中

lib/session.ts: CLS 异步资源存储
lib/logger.ts: 全局 logger，可参考上一章如何在 Node 中高效地打日志[3]

// lib/session.ts
import { createNamespace } from 'cls-hooked'

const session = createNamespace('hello, world')

export { session }

// lib/logger.ts
import winston, { format } from 'winston'

const requestId = format((info) => {
  // 每次自动获取 requestId
  info.requestId = session.get('requestId')
  return info
})

const logger = winston.createLogger({
  format: format.combine(
    requestId(),
    format.json()
  )
})

如果你使用过 zipkin，一款全链路式日志分析工具的话，它其中也是用了 CLS

const zipkin = require('zipkin');

// In Node.js, the recommended context API to use is zipkin-context-cls.
const CLSContext = require('zipkin-context-cls');
const ctxImpl = new CLSContext(); // if you want to use CLS
const xtxImpl = new zipkin.ExplicitContext(); // Alternative; if you want to pass around the context manually

// Tracer will be a one to many relationship with instrumentation that use it (like express)
const tracer = new zipkin.Tracer({
  ctxImpl, // the in-process context
  recorder: new zipkin.ConsoleRecorder(), // For easy debugging. You probably want to use an actual implementation, like Kafka or AWS SQS.
  sampler: new zipkin.sampler.CountingSampler(0.01), // sample rate 0.01 will sample 1 % of all incoming requests
  traceId128Bit: true, // to generate 128-bit trace IDs. 64-bit (false) is default
  localServiceName: 'my-service' // indicates this node in your service graph
});

❝关于 async_hooks 及 CLS 的实现可参考下一章节

❞

如何从全链路日志中得益

当 sentry (警报系统) 中收到一条异常警报时，通过 requestId 可以在 elk (日志系统) 中获取到关于该异常的所有关键日志 (sql, redis, 关键函数的输入输出)
当客户端一条请求过慢时，通过请求头获取到的 requestId 可以在 elk 中分析该请求的所有数据库查询时间，请求响应时间，缓存是否命中等指标
查找 API 对应执行的 SQL 语句以及条数，判断是否有冗余 SQL 语句的查询

如下图，通过 requestId 涉及到的数据库查询的日志 (关于真实 SQL 做了隐藏处理)

从全链路分析各阶段耗时

以上只是完善了全链路式日志，如果要查看全链路各个阶段的耗时如何做？

此时就需要接入全链路式分析平台，如

zipkin[4]: Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in service architectures. Features include both the collection and lookup of this data.
jaeger[5]: open source, end-to-end distributed tracing

下一章讲解如何接入 zipkin 及相关代码与操作