Hikari经常出现Failed to validate connection并阻塞应用线程问题

在分布式系统中的一个dubbo服务,日志中经常出现

问题描述

在分布式系统中的一个dubbo服务,日志中经常出现:

1
[WARN ]2018-07-06 19:30:49.581 [DubboServerHandler-10.206.56.200:20008-thread-195]PoolBase | HikariPool-1 - Failed to validate connection com.mysql.jdbc.JDBC4Connection@6a9ac783 (No operations allowed after connection closed.)

同时导致对应请求长时间阻塞,阻塞时间从5s~20s不等。

问题分析

首先从Github上下载[Hikari源码][1],在源码中直接搜索到对应的错误信息日志。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
boolean isConnectionAlive(final Connection connection)
{
try {
try {
setNetworkTimeout(connection, validationTimeout);

final int validationSeconds = (int) Math.max(1000L, validationTimeout) / 1000;

if (isUseJdbc4Validation) {
return connection.isValid(validationSeconds);
}

try (Statement statement = connection.createStatement()) {
if (isNetworkTimeoutSupported != TRUE) {
setQueryTimeout(statement, validationSeconds);
}

statement.execute(config.getConnectionTestQuery());
}
}
finally {
setNetworkTimeout(connection, networkTimeout);

if (isIsolateInternalQueries && !isAutoCommit) {
connection.rollback();
}
}

return true;
}
catch (Exception e) {
lastConnectionFailure.set(e);
LOGGER.warn("{} - Failed to validate connection {} ({}). Possibly consider using a shorter maxLifetime value.", poolName, connection, e.getMessage());
return false;
}
}

从代码中可以看出,错误是从这个isConnectionAlive()中抛出来的,从堆栈可以看出异常是从setNetworkTimeout()方法抛出,这个方法里面逻辑很简单,就是对当前数据库连接设置一个网络超时时间,其实相当于同时检查了连接是否可用。所以这个异常的根本原因就是当前数据库连接已经不可用,但是却依旧存在于Hikari的连接池中。

再来看应用层获取连接的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public Connection getConnection(final long hardTimeout) throws SQLException
{
suspendResumeLock.acquire();
final long startTime = currentTime();

try {
long timeout = hardTimeout;
do {
PoolEntry poolEntry = connectionBag.borrow(timeout, MILLISECONDS);
if (poolEntry == null) {
break; // We timed out... break and throw exception
}

final long now = currentTime();
if (poolEntry.isMarkedEvicted() || (elapsedMillis(poolEntry.lastAccessed, now) > ALIVE_BYPASS_WINDOW_MS && !isConnectionAlive(poolEntry.connection))) {
closeConnection(poolEntry, poolEntry.isMarkedEvicted() ? EVICTED_CONNECTION_MESSAGE : DEAD_CONNECTION_MESSAGE);
timeout = hardTimeout - elapsedMillis(startTime);
}
else {
metricsTracker.recordBorrowStats(poolEntry, startTime);
return poolEntry.createProxyConnection(leakTaskFactory.schedule(poolEntry), now);
}
} while (timeout > 0L);

metricsTracker.recordBorrowTimeoutStats(startTime);
throw createTimeoutException(startTime);
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new SQLException(poolName + " - Interrupted during connection acquisition", e);
}
finally {
suspendResumeLock.release();
}
}

逻辑很简答,在while循环里面会从连接池中拿到连接,并通过isConnectionAlive()方法来判断连接是否可用,其中会为连接设置忘了超时setNetworkTimeout(),因此如果连接失效,则会阻塞一个validationTImeout的时间周期并关闭连接。然后继续从连接池取一下连接,如此循环,直到拿到可用连接,或者连接池为空并抛出异常。这就解释了为什么我们在日志中看到的阻塞时间都是5s的倍数,因为这和连续几次拿到无效连接有关。

那为什么连接池中会有无效的连接呢?我们来看连接创建的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/**
* Creating new poolEntry. If maxLifetime is configured, create a future End-of-life task with 2.5% variance from
* the maxLifetime time to ensure there is no massive die-off of Connections in the pool.
*/
private PoolEntry createPoolEntry()
{
try {
final PoolEntry poolEntry = newPoolEntry();

final long maxLifetime = config.getMaxLifetime();
if (maxLifetime > 0) {
// variance up to 2.5% of the maxlifetime
final long variance = maxLifetime > 10_000 ? ThreadLocalRandom.current().nextLong( maxLifetime / 40 ) : 0;
final long lifetime = maxLifetime - variance;
poolEntry.setFutureEol(houseKeepingExecutorService.schedule(
() -> {
if (softEvictConnection(poolEntry, "(connection has passed maxLifetime)", false /* not owner */)) {
addBagItem(connectionBag.getWaitingThreadCount());
}
},
lifetime, MILLISECONDS));
}

return poolEntry;
}
catch (Exception e) {
if (poolState == POOL_NORMAL) { // we check POOL_NORMAL to avoid a flood of messages if shutdown() is running concurrently
LOGGER.debug("{} - Cannot acquire connection from data source", poolName, (e instanceof ConnectionSetupException ? e.getCause() : e));
}
return null;
}
}

可以看到每个连接创建时都会对应的创建一个schedule任务去调用softEvictConnection()方法将连接设置为Evicted状态,可以看到这个时间的lifetime就是根据我们设置的maxLifetime减去一个随机误差的。也就说如果在lifetime时间内,连接失效Hikari并不会把它从连接池中清理掉。Hikari的默认maxLifetime是30分钟,mysql默认的连接超时是8小时,所以一般不会碰到这个问题。但是由于我在项目中使用的是分布式数据库ddb,所以怀疑是ddb的连接超时小于30分钟导致的。去找dba询问,果然给出的时间是1000s,也就是16分钟左右。

问题解决

把Hikari的maxlifetime(注意计算随机误差)设置成比数据库连接超时更短的时间。dbcp中也有类似的配置:minEvictableIdleTimeMillis。

[1]: https://github.com/brettwooldridge/HikariCP