Hikari经常出现Failed to validate connection并阻塞应用线程问题

发表于 2018-07-11 分类于 technology

在分布式系统中的一个dubbo服务，日志中经常出现

问题描述

在分布式系统中的一个dubbo服务，日志中经常出现：

[WARN ]2018-07-06 19:30:49.581 [DubboServerHandler-10.206.56.200:20008-thread-195]PoolBase | HikariPool-1 - Failed to validate connection com.mysql.jdbc.JDBC4Connection@6a9ac783 (No operations allowed after connection closed.)

同时导致对应请求长时间阻塞，阻塞时间从5s~20s不等。

问题分析

首先从Github上下载[Hikari源码][1]，在源码中直接搜索到对应的错误信息日志。

boolean isConnectionAlive(final Connection connection)
{
   try {
      try {
         setNetworkTimeout(connection, validationTimeout);

         final int validationSeconds = (int) Math.max(1000L, validationTimeout) / 1000;

         if (isUseJdbc4Validation) {
            return connection.isValid(validationSeconds);
         }

         try (Statement statement = connection.createStatement()) {
            if (isNetworkTimeoutSupported != TRUE) {
               setQueryTimeout(statement, validationSeconds);
            }

            statement.execute(config.getConnectionTestQuery());
         }
      }
      finally {
         setNetworkTimeout(connection, networkTimeout);

         if (isIsolateInternalQueries && !isAutoCommit) {
            connection.rollback();
         }
      }

      return true;
   }
   catch (Exception e) {
      lastConnectionFailure.set(e);
      LOGGER.warn("{} - Failed to validate connection {} ({}). Possibly consider using a shorter maxLifetime value.", poolName, connection, e.getMessage());
      return false;
   }
}

从代码中可以看出，错误是从这个isConnectionAlive()中抛出来的，从堆栈可以看出异常是从setNetworkTimeout()方法抛出，这个方法里面逻辑很简单，就是对当前数据库连接设置一个网络超时时间，其实相当于同时检查了连接是否可用。所以这个异常的根本原因就是当前数据库连接已经不可用，但是却依旧存在于Hikari的连接池中。

再来看应用层获取连接的方法：

public Connection getConnection(final long hardTimeout) throws SQLException
{
   suspendResumeLock.acquire();
   final long startTime = currentTime();

   try {
      long timeout = hardTimeout;
      do {
         PoolEntry poolEntry = connectionBag.borrow(timeout, MILLISECONDS);
         if (poolEntry == null) {
            break; // We timed out... break and throw exception
         }

         final long now = currentTime();
         if (poolEntry.isMarkedEvicted() || (elapsedMillis(poolEntry.lastAccessed, now) > ALIVE_BYPASS_WINDOW_MS && !isConnectionAlive(poolEntry.connection))) {
            closeConnection(poolEntry, poolEntry.isMarkedEvicted() ? EVICTED_CONNECTION_MESSAGE : DEAD_CONNECTION_MESSAGE);
            timeout = hardTimeout - elapsedMillis(startTime);
         }
         else {
            metricsTracker.recordBorrowStats(poolEntry, startTime);
            return poolEntry.createProxyConnection(leakTaskFactory.schedule(poolEntry), now);
         }
      } while (timeout > 0L);

      metricsTracker.recordBorrowTimeoutStats(startTime);
      throw createTimeoutException(startTime);
   }
   catch (InterruptedException e) {
      Thread.currentThread().interrupt();
      throw new SQLException(poolName + " - Interrupted during connection acquisition", e);
   }
   finally {
      suspendResumeLock.release();
   }
}

逻辑很简答，在while循环里面会从连接池中拿到连接，并通过isConnectionAlive()方法来判断连接是否可用，其中会为连接设置忘了超时setNetworkTimeout()，因此如果连接失效，则会阻塞一个validationTImeout的时间周期并关闭连接。然后继续从连接池取一下连接，如此循环，直到拿到可用连接，或者连接池为空并抛出异常。这就解释了为什么我们在日志中看到的阻塞时间都是5s的倍数，因为这和连续几次拿到无效连接有关。

那为什么连接池中会有无效的连接呢？我们来看连接创建的方法：

/**
 * Creating new poolEntry.  If maxLifetime is configured, create a future End-of-life task with 2.5% variance from
 * the maxLifetime time to ensure there is no massive die-off of Connections in the pool.
 */
private PoolEntry createPoolEntry()
{
   try {
      final PoolEntry poolEntry = newPoolEntry();

      final long maxLifetime = config.getMaxLifetime();
      if (maxLifetime > 0) {
         // variance up to 2.5% of the maxlifetime
         final long variance = maxLifetime > 10_000 ? ThreadLocalRandom.current().nextLong( maxLifetime / 40 ) : 0;
         final long lifetime = maxLifetime - variance;
         poolEntry.setFutureEol(houseKeepingExecutorService.schedule(
            () -> {
               if (softEvictConnection(poolEntry, "(connection has passed maxLifetime)", false /* not owner */)) {
                  addBagItem(connectionBag.getWaitingThreadCount());
               }
            },
            lifetime, MILLISECONDS));
      }

      return poolEntry;
   }
   catch (Exception e) {
      if (poolState == POOL_NORMAL) { // we check POOL_NORMAL to avoid a flood of messages if shutdown() is running concurrently
         LOGGER.debug("{} - Cannot acquire connection from data source", poolName, (e instanceof ConnectionSetupException ? e.getCause() : e));
      }
      return null;
   }
}

可以看到每个连接创建时都会对应的创建一个schedule任务去调用softEvictConnection()方法将连接设置为Evicted状态，可以看到这个时间的lifetime就是根据我们设置的maxLifetime减去一个随机误差的。也就说如果在lifetime时间内，连接失效Hikari并不会把它从连接池中清理掉。Hikari的默认maxLifetime是30分钟，mysql默认的连接超时是8小时，所以一般不会碰到这个问题。但是由于我在项目中使用的是分布式数据库ddb，所以怀疑是ddb的连接超时小于30分钟导致的。去找dba询问，果然给出的时间是1000s，也就是16分钟左右。

问题解决

把Hikari的maxlifetime（注意计算随机误差）设置成比数据库连接超时更短的时间。dbcp中也有类似的配置：minEvictableIdleTimeMillis。