又踩了一个数据库连接池的坑,纪念一下。公司的分布式数据库由于更新驱动层,并且目前还只支持Druid连接池,因此项目的数据库连接池也从Hikari换到了Druid。然后换上没多久,测试服就开始频繁报警。
1 2 3 4 5 6 The last packet successfully received from the server was 2,357,232 milliseconds ago. The last packet sent successfully to the server was 931,359 milliseconds ago. CommunicationsException, druid version 1.1.10, jdbcUrl : jdbc:mysql:ddb://xx.xxx.xxx.xxx:6000/dbname, testWhileIdle true, idle millis 22216001, minIdle 20, poolingCount 19, timeBetweenEvictionRunsMillis 60000, lastValidIdleMillis 22216001, driver com.netease.lbd.LBDriver, exceptionSorter com.alibaba.druid.pool.vendor.MySqlExceptionSorter
这个错误的原因很明显,使用了一个已经失效的连接去做db操作,就会返回这个错误。也就是说db操作的时候从数据库连接池里拿到了一个烂的连接。然而首先我们是配置了time-between-eviction-runs-millis的,理论上应该是会定时对连接池中的连接做检查的,但是实际结果显然并没有。
我们直接开始检查druid源码吧,在核心类DruidDataSource中我们可以看到初始化连接池的时候会调用createAndStartDestroyThread()创建一个定时任务来执行DestroyTask,再看DestroyTask里面做了什么。
1 2 3 4 5 6 7 8 9 10 11 12 public class DestroyTask implements Runnable { @Override public void run () { shrink(true , keepAlive); if (isRemoveAbandoned()) { removeAbandoned(); } } }
其中isRemoveAbandoned()没有配置,默认是false的。所以,这个task就执行了一句话:shrink(true, keepAlive)
在shrink方法里面的最后我们可以看到有这段代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 if (keepAliveCount > 0 ) { this .getDataSourceStat().addKeepAliveCheckCount(keepAliveCount); for (int i = keepAliveCount - 1 ; i >= 0 ; --i) { DruidConnectionHolder holer = keepAliveConnections[i]; Connection connection = holer.getConnection(); holer.incrementKeepAliveCheckCount(); boolean validate = false ; try { this .validateConnection(connection); validate = true ; } catch (Throwable error) { if (LOG.isDebugEnabled()) { LOG.debug("keepAliveErr" , error); } } if (validate) { holer.lastActiveTimeMillis = System.currentTimeMillis(); put(holer); } else { JdbcUtils.close(connection); } } Arrays.fill(keepAliveConnections, null ); }
用于检查经过shrink以后留存下来的连接是否可用。其中的主要方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 public void validateConnection (Connection conn) throws SQLException { String query = getValidationQuery(); if (conn.isClosed()) { throw new SQLException ("validateConnection: connection closed" ); } if (validConnectionChecker != null ) { boolean result = true ; Exception error = null ; try { result = validConnectionChecker.isValidConnection(conn, validationQuery, validationQueryTimeout); } catch (SQLException ex) { throw ex; } catch (Exception ex) { error = ex; } if (!result) { SQLException sqlError = error != null ? new SQLException ("validateConnection false" , error) : new SQLException ("validateConnection false" ); throw sqlError; } return ; } if (null != query) { Statement stmt = null ; ResultSet rs = null ; try { stmt = conn.createStatement(); if (getValidationQueryTimeout() > 0 ) { stmt.setQueryTimeout(getValidationQueryTimeout()); } rs = stmt.executeQuery(query); if (!rs.next()) { throw new SQLException ("validationQuery didn't return a row" ); } } finally { JdbcUtils.close(rs); JdbcUtils.close(stmt); } } }
再跟进去到validConnectionChecker.isValidConnection(conn, validationQuery, validationQueryTimeout)方法。在这个方法里面可以看到
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 public boolean isValidConnection (Connection conn, String validateQuery, int validationQueryTimeout) throws Exception { if (conn.isClosed()) { return false ; } String query = validateQuery; if (validateQuery == null || validateQuery.isEmpty()) { query = DEFAULT_VALIDATION_QUERY; } Statement stmt = null ; ResultSet rs = null ; try { stmt = conn.createStatement(); if (validationQueryTimeout > 0 ) { stmt.setQueryTimeout(validationQueryTimeout); } rs = stmt.executeQuery(query); return true ; } finally { JdbcUtils.close(rs); JdbcUtils.close(stmt); } }
如果没有配置validationQuery,会默认使用 “SELECT 1”。于是我们认为不配置validationQuery是没关系的。然后没过多久,还是继续出现了上面说的异常,说明就是有烂的连接留存在连接池里。
重新再看一遍代码,注意到上面的validConnectionChecker有没有可能是null呢?于是找到validConnectionChecker的初始化方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 private void initValidConnectionChecker () { if (this .validConnectionChecker != null ) { return ; } String realDriverClassName = driver.getClass().getName(); if (JdbcUtils.isMySqlDriver(realDriverClassName)) { this .validConnectionChecker = new MySqlValidConnectionChecker (); } else if (realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER) || realDriverClassName.equals(JdbcConstants.ORACLE_DRIVER2)) { this .validConnectionChecker = new OracleValidConnectionChecker (); } else if (realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER) || realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_SQLJDBC4) || realDriverClassName.equals(JdbcConstants.SQL_SERVER_DRIVER_JTDS)) { this .validConnectionChecker = new MSSQLValidConnectionChecker (); } else if (realDriverClassName.equals(JdbcConstants.POSTGRESQL_DRIVER) || realDriverClassName.equals(JdbcConstants.ENTERPRISEDB_DRIVER)) { this .validConnectionChecker = new PGValidConnectionChecker (); } }
发现如果是数据库的标准驱动,那么validConnectionChecker就是对应数据库的checker,然后恰巧我们的项目中用的是自己实现的db driver。所以这个地方其实根本没有初始化成功。所以validConnectionChecker果然是个null。 因此上面的检查代码其实根本没有进到isValidConnection()方法中,所以默认的”SELECT 1”也不起作用了。 validateConnection()方法再往下可以看到if (null != query) 的条件下,druid还有一个兜底检查行为。 然而又恰好我们又没有配置validationQuery,所以最后兜底也没有兜到。
加上配置以后,第二天发现还是出现了一样的问题,说明连接池里还是有烂掉的连接。继续看源码,直接查看会调用validationConnection的核心方法shrink:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 public void shrink (boolean checkTime, boolean keepAlive) { try { lock.lockInterruptibly(); } catch (InterruptedException e) { return ; } int evictCount = 0 ; int keepAliveCount = 0 ; try { if (!inited) { return ; } final int checkCount = poolingCount - minIdle; final long currentTimeMillis = System.currentTimeMillis(); for (int i = 0 ; i < poolingCount; ++i) { DruidConnectionHolder connection = connections[i]; if (checkTime) { if (phyTimeoutMillis > 0 ) { long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis; if (phyConnectTimeMillis > phyTimeoutMillis) { evictConnections[evictCount++] = connection; continue ; } } long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis; if (idleMillis < minEvictableIdleTimeMillis) { break ; } if (checkTime && i < checkCount) { evictConnections[evictCount++] = connection; } else if (idleMillis > maxEvictableIdleTimeMillis) { evictConnections[evictCount++] = connection; } else if (keepAlive) { keepAliveConnections[keepAliveCount++] = connection; } } else { if (i < checkCount) { evictConnections[evictCount++] = connection; } else { break ; } } } int removeCount = evictCount + keepAliveCount; if (removeCount > 0 ) { System.arraycopy(connections, removeCount, connections, 0 , poolingCount - removeCount); Arrays.fill(connections, poolingCount - removeCount, poolingCount, null ); poolingCount -= removeCount; } keepAliveCheckCount += keepAliveCount; } finally { lock.unlock(); } if (evictCount > 0 ) { for (int i = 0 ; i < evictCount; ++i) { DruidConnectionHolder item = evictConnections[i]; Connection connection = item.getConnection(); JdbcUtils.close(connection); destroyCountUpdater.incrementAndGet(this ); } Arrays.fill(evictConnections, null ); } if (keepAliveCount > 0 ) { this .getDataSourceStat().addKeepAliveCheckCount(keepAliveCount); for (int i = keepAliveCount - 1 ; i >= 0 ; --i) { DruidConnectionHolder holer = keepAliveConnections[i]; Connection connection = holer.getConnection(); holer.incrementKeepAliveCheckCount(); boolean validate = false ; try { this .validateConnection(connection); validate = true ; } catch (Throwable error) { if (LOG.isDebugEnabled()) { LOG.debug("keepAliveErr" , error); } } if (validate) { holer.lastActiveTimeMillis = System.currentTimeMillis(); put(holer); } else { JdbcUtils.close(connection); } } Arrays.fill(keepAliveConnections, null ); } }
从代码中可以看出来,keepAlive的默认值是false,这样就会导致每次shrink之后的keepAliveCount==0,也就进不到最后检查存活连接的逻辑中。并且会导致nimIdle的配置失效,连接池最后会被回收至只剩0个连接。所以,我们还差一下配置,keepAlive必须配置为true。
总结:用开源还是得多看源码。还有,druid的代码写的真的是乱,非常乱。对比Hikari的代码,代码可读性差的不是一点点。