Druid经常出现连接不可用的问题

又踩了一个数据库连接池的坑,纪念一下。公司的分布式数据库由于更新驱动层,并且目前还只支持Druid连接池,因此项目的数据库连接池也从Hikari换到了Druid。然后换上没多久,测试服就开始频繁报警。

1
2
3
4
5
6
The last packet successfully received from the server was 2,357,232 milliseconds ago.  
The last packet sent successfully to the server was 931,359 milliseconds ago.

CommunicationsException, druid version 1.1.10, jdbcUrl : jdbc:mysql:ddb://xx.xxx.xxx.xxx:6000/dbname,
testWhileIdle true, idle millis 22216001, minIdle 20, poolingCount 19, timeBetweenEvictionRunsMillis 60000,
lastValidIdleMillis 22216001, driver com.netease.lbd.LBDriver, exceptionSorter com.alibaba.druid.pool.vendor.MySqlExceptionSorter

这个错误的原因很明显,使用了一个已经失效的连接去做db操作,就会返回这个错误。也就是说db操作的时候从数据库连接池里拿到了一个烂的连接。然而首先我们是配置了time-between-eviction-runs-millis的,理论上应该是会定时对连接池中的连接做检查的,但是实际结果显然并没有。

我们直接开始检查druid源码吧,在核心类DruidDataSource中我们可以看到初始化连接池的时候会调用createAndStartDestroyThread()创建一个定时任务来执行DestroyTask,再看DestroyTask里面做了什么。

1
2
3
4
5
6
7
8
9
10
11
12
public class DestroyTask implements Runnable {

@Override
public void run() {
shrink(true, keepAlive);

if (isRemoveAbandoned()) {
removeAbandoned();
}
}

}

其中isRemoveAbandoned()没有配置,默认是false的。所以,这个task就执行了一句话:shrink(true, keepAlive)

在shrink方法里面的最后我们可以看到有这段代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
if (keepAliveCount > 0) {
this.getDataSourceStat().addKeepAliveCheckCount(keepAliveCount);
// keep order
for (int i = keepAliveCount - 1; i >= 0; --i) {
DruidConnectionHolder holer = keepAliveConnections[i];
Connection connection = holer.getConnection();
holer.incrementKeepAliveCheckCount();

boolean validate = false;
try {
this.validateConnection(connection); // 重点
validate = true;
} catch (Throwable error) {
if (LOG.isDebugEnabled()) {
LOG.debug("keepAliveErr", error);
}
// skip
}

if (validate) {
holer.lastActiveTimeMillis = System.currentTimeMillis();
put(holer);
} else {
JdbcUtils.close(connection);
}
}
Arrays.fill(keepAliveConnections, null);
}

用于检查经过shrink以后留存下来的连接是否可用。其中的主要方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public void validateConnection(Connection conn) throws SQLException {
String query = getValidationQuery();
if (conn.isClosed()) {
throw new SQLException("validateConnection: connection closed");
}

if (validConnectionChecker != null) {
boolean result = true;
Exception error = null;
try {
result = validConnectionChecker.isValidConnection(conn, validationQuery, validationQueryTimeout);
} catch (SQLException ex) {
throw ex;
} catch (Exception ex) {
error = ex;
}

if (!result) {
SQLException sqlError = error != null ? //
new SQLException("validateConnection false", error) //
: new SQLException("validateConnection false");
throw sqlError;
}
return;
}

if (null != query) {
Statement stmt = null;
ResultSet rs = null;
try {
stmt = conn.createStatement();
if (getValidationQueryTimeout() > 0) {
stmt.setQueryTimeout(getValidationQueryTimeout());
}
rs = stmt.executeQuery(query);
if (!rs.next()) {
throw new SQLException("validationQuery didn't return a row");
}
} finally {
JdbcUtils.close(rs);
JdbcUtils.close(stmt);
}
}
}

再跟进去到validConnectionChecker.isValidConnection(conn, validationQuery, validationQueryTimeout)方法。在这个方法里面可以看到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public boolean isValidConnection(Connection conn, String validateQuery, int validationQueryTimeout) throws Exception {
if (conn.isClosed()) {
return false;
}

// 省略 。。。

String query = validateQuery;
if (validateQuery == null || validateQuery.isEmpty()) {
query = DEFAULT_VALIDATION_QUERY;
}

Statement stmt = null;
ResultSet rs = null;
try {
stmt = conn.createStatement();
if (validationQueryTimeout > 0) {
stmt.setQueryTimeout(validationQueryTimeout);
}
rs = stmt.executeQuery(query);
return true;
} finally {
JdbcUtils.close(rs);
JdbcUtils.close(stmt);
}

}

如果没有配置validationQuery,会默认使用 “SELECT 1”。于是我们认为不配置validationQuery是没关系的。然后没过多久,还是继续出现了上面说的异常,说明就是有烂的连接留存在连接池里。

重新再看一遍代码,注意到上面的validConnectionChecker有没有可能是null呢?于是找到validConnectionChecker的初始化方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
private void initValidConnectionChecker() {
if (this.validConnectionChecker != null) {
return;
}

String realDriverClassName = driver.getClass().getName();
if (JdbcUtils.isMySqlDriver(realDriverClassName)) {
this.validConnectionChecker = new MySqlValidConnectionChecker();

} else if (realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER)
|| realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER2)) {
this.validConnectionChecker = new OracleValidConnectionChecker();

} else if (realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER)
|| realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_SQLJDBC4)
|| realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_JTDS)) {
this.validConnectionChecker = new MSSQLValidConnectionChecker();

} else if (realDriverClassName.equals(JdbcConstants.POSTGRESQL_DRIVER)
|| realDriverClassName.equals(JdbcConstants.ENTERPRISEDB_DRIVER)) {
this.validConnectionChecker = new PGValidConnectionChecker();
}
}

发现如果是数据库的标准驱动,那么validConnectionChecker就是对应数据库的checker,然后恰巧我们的项目中用的是自己实现的db driver。所以这个地方其实根本没有初始化成功。所以validConnectionChecker果然是个null。
因此上面的检查代码其实根本没有进到isValidConnection()方法中,所以默认的”SELECT 1”也不起作用了。
validateConnection()方法再往下可以看到if (null != query) 的条件下,druid还有一个兜底检查行为。
然而又恰好我们又没有配置validationQuery,所以最后兜底也没有兜到。

加上配置以后,第二天发现还是出现了一样的问题,说明连接池里还是有烂掉的连接。继续看源码,直接查看会调用validationConnection的核心方法shrink:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
public void shrink(boolean checkTime, boolean keepAlive) {
try {
lock.lockInterruptibly();
} catch (InterruptedException e) {
return;
}

int evictCount = 0;
int keepAliveCount = 0;
try {
if (!inited) {
return;
}

final int checkCount = poolingCount - minIdle;
final long currentTimeMillis = System.currentTimeMillis();
for (int i = 0; i < poolingCount; ++i) {
DruidConnectionHolder connection = connections[i];

if (checkTime) {
if (phyTimeoutMillis > 0) {
long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
if (phyConnectTimeMillis > phyTimeoutMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}

long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;

if (idleMillis < minEvictableIdleTimeMillis) {
break;
}

if (checkTime && i < checkCount) {
evictConnections[evictCount++] = connection;
} else if (idleMillis > maxEvictableIdleTimeMillis) {
evictConnections[evictCount++] = connection;
} else if (keepAlive) {
keepAliveConnections[keepAliveCount++] = connection;
}
} else {
if (i < checkCount) {
evictConnections[evictCount++] = connection;
} else {
break;
}
}
}

int removeCount = evictCount + keepAliveCount;
if (removeCount > 0) {
System.arraycopy(connections, removeCount, connections, 0, poolingCount - removeCount);
Arrays.fill(connections, poolingCount - removeCount, poolingCount, null);
poolingCount -= removeCount;
}
keepAliveCheckCount += keepAliveCount;
} finally {
lock.unlock();
}

if (evictCount > 0) {
for (int i = 0; i < evictCount; ++i) {
DruidConnectionHolder item = evictConnections[i];
Connection connection = item.getConnection();
JdbcUtils.close(connection);
destroyCountUpdater.incrementAndGet(this);
}
Arrays.fill(evictConnections, null);
}

if (keepAliveCount > 0) {
this.getDataSourceStat().addKeepAliveCheckCount(keepAliveCount);
// keep order
for (int i = keepAliveCount - 1; i >= 0; --i) {
DruidConnectionHolder holer = keepAliveConnections[i];
Connection connection = holer.getConnection();
holer.incrementKeepAliveCheckCount();

boolean validate = false;
try {
this.validateConnection(connection);
validate = true;
} catch (Throwable error) {
if (LOG.isDebugEnabled()) {
LOG.debug("keepAliveErr", error);
}
// skip
}

if (validate) {
holer.lastActiveTimeMillis = System.currentTimeMillis();
put(holer);
} else {
JdbcUtils.close(connection);
}
}
Arrays.fill(keepAliveConnections, null);
}
}

从代码中可以看出来,keepAlive的默认值是false,这样就会导致每次shrink之后的keepAliveCount==0,也就进不到最后检查存活连接的逻辑中。并且会导致nimIdle的配置失效,连接池最后会被回收至只剩0个连接。所以,我们还差一下配置,keepAlive必须配置为true。

总结:用开源还是得多看源码。还有,druid的代码写的真的是乱,非常乱。对比Hikari的代码,代码可读性差的不是一点点。