GRPC丢包之路

发表于 2017-06-18 分类于 technology

GRPC是Google开源的一个高性能、跨语言的RPC框架，基于HTTP2协议，基于protobuf-3.x，基于Netty-4.x。

最近开发一个公司的游戏项目，采用分布式的服务器架构，内部服务之间均采用GRPC进行通信。但是在测试过程中发现一个重大的bug，偶现丢包。

前置信息：ServerA（作为grpc的client端）通过futureStub异步调用ServerB（作为server端）。

查看日志发现，client中有调用发送请求的日志，但server端却没有任何收到请求的日志。于是怀疑server在接收请求的过程中吃掉了异常，对整个过程加上try-catch，然而并没有任何发现，于是怀疑根本没有收到请求。通过tcpdump抓包后发现确实是client没发包到网络层，由此确定了是client端问题。

怀疑是grpc的futureStub有bug，于是在finally中加入future.get()打印出调用结果。结果一天下来一直没有发现任何丢包，”完美“。懵逼了，去掉finally中的future get以后又出现丢包。这个时候，直觉判断，由于调用futue.get()会block当前线程，所以不会出现丢包，而不block就会丢包。有种接近真相的感觉。检查代码后发现client端创建stub的时候没有指定线程池，grpc会默认使用当前线程，结合偶现的表现，觉得非常像是线程在异步发送的过程中被销毁导致。

然而，指定线程池以后，依旧发现丢包。无奈之下只有开始查看grpc源码，并打印出grpc的日志。在日志中发现了如下错误：

1
2
3

----------------OUTBOUND--------------------
[id: 0x7dee8b32, L:/10.82.7.14:60365 - R:/10.82.7.14:9001] RST_STREAM: streamId=4025, errorCode=8
------------------------------------

看到errorCode=8，并结合grpc源码中对io.grpc.Status的定义：

/**
 * Some resource has been exhausted, perhaps a per-user quota, or
 * perhaps the entire file system is out of space.
 */
RESOURCE_EXHAUSTED(8),

推断是由于系统文件句柄资源耗尽导致，查看服务器系统文件句柄限制为65535，而进程打开的限制远不止这么多。于是找sa将文件句柄限制改为100w，然后信心满满的重启了全部服务，重新进行测试。结果很快又发现了丢包，内心崩溃！

由于当future.get()方法阻塞线程时，就不会出现丢包，无法重现。于是想到把futureStub改为stub，在保持异步的同时，通过observer回调来观察调用结果,这次在回调的onError方法中发现了如下异常：

CANCELLED,
io.grpc.Status.asRuntimeException(Status.java:543), 
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395), 
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426), 
io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:76), 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:512), 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:429), 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:544), 
io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52), 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:117), 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142), 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617), 
java.lang.Thread.run(Thread.java:745

终于发现原来是这个call被grpc在底层cancel掉了。查代码发现原来上面的errorCode=8是指http2的code定义：

/**
 * Mapped to call cancellation when sent by a client. 
 * Mapped to CANCELLED when sent by a server. 
 * Note that servers should only use this mechanism when they need to cancel a call but the payload byte sequence is incomplete.
 */
CANCEL(8)

知道了是由于被cancel才导致的丢包，那么就要查找为什么会被cancel，由于grpc的stub调用是通过异步的线性执行器执行io.grpc.ClientCall。所以堆栈中无法看到整个调用过程，只能通过源码来看。不管是哪种stub的调用，最后都会走到ClientCalls.startCall()方法中，然后进ClientCallImpl.start()中，在这个方法中，可以看到与cancel有关的代码：

if (context.isCancelled()) {
  // Context is already cancelled so no need to create a real stream, just notify the observer
  // of cancellation via callback on the executor
  stream = NoopClientStream.INSTANCE;
  class ClosedByContext extends ContextRunnable {
    ClosedByContext() {
      super(context);
    }

    @Override
    public void runInContext() {
      closeObserver(observer, statusFromCancelled(context), new Metadata());
    }
  }

  callExecutor.execute(new ClosedByContext());
  return;
}

可以看出来cancel的原因是由于context.isCancelled()的结果为true

public boolean isCancelled() {
  if (parent == null || !cascadesCancellation) {
    return false;
  } else {
    return parent.isCancelled();
  }
}

可以看出，如果parent==null，则不可能为true，所以必然是由于parent.isCancelled()为true导致的。由此我们基本可以猜测出来，当前的ClientCall被cancel的原因是由于它的parent context canceled导致。

回到ClientCallImpl中，可以看到this.context = Context.current();表明每个ClientCall被创建的时候，context都会被赋值为Context.current()

public static Context current() {
  Context current = storage().current();
  if (current == null) {
    return ROOT;
  }
  return current;
}

如果当前没有context，则用ROOT作为context

/**
 * The logical root context which is the ultimate ancestor of all contexts. This context
 * is not cancellable and so will not cascade cancellation or retain listeners.
 *
 * Never assume this is the default context for new threads, because {@link Storage} may define
 * a default context that is different from ROOT.
 */
public static final Context ROOT = new Context(null);

在注释中可以看出，ROOT是不可以被cancel的，所以可以判断，当前必然有一个grpc的context存在。

最后联想到由于业务代码中，ClientCall是在ServerA处理grpc请求的过程中执行的，那么这个Context必然是由GrpcServer处理请求的时候生成的，并赋值给了ClientCall。而由于ClientCall是线性执行器中异步执行，当真正执行ContextRunnable的时候，GrpcServer的context可能已经被cancel，导致当前ClientCall被cancel。

解决办法：可以对当前context进行fork，或者把每个call都封装为Runnable，通过Context提供的run静态方法执行。

至此，终于弄清楚了grpc丢包的原因。虽然，这个问题通过业务层代码可以避免掉，但是我觉得grpc在底层默默的把这件事做了，并且没有往上层抛出任何信息，这对开发者是非常不友好的。同时，这个cancel逻辑的处理上，我觉得也有待商榷。毕竟在GrpcServer中异步的去执行GrpcCall是一个非常普遍合理的模型。也许以后Google会改吧。