[pgcluster: 1045] SbyLAN障害時のINSERTでロックが発生する件

村上智一 s1puser24 @ gmail.com
2010年 10月 16日 (土) 13:06:26 JST


はじめまして。村上と申します。

Linuxサーバでpgclusterを使い冗長構成をとっているのですが、Sby系の
LAN抜線後にDB更新すると応答が返ってこない状態になります。
解決方法についてご教示いただければ幸いです。


■問題内容

PgClusterバージョン:pgcluster-1.9.0rc6

以下の構成で、PGCluster2 サーバ にて
LAN抜線による障害を発生させたときに、
JAVAアプリケーションから
PGCluster1 サーバへの、JDBC経由のINSERTを行うと、
応答が返ってこない状況になります。

※常に発生するわけではなく、
抜線直後にINSERTすると発生する確率が高いようです。

※この事象が発生すると、
そのときに書込みを行っていたテーブルへのその後のアクセスが不能になります。
(他のテーブルへのアクセスは問題なくできます。)

             _________
    _________| switch |________こちらのLANを抜線
    |        |________|        |
 _____|_____                _____|_____
 |PGCluster1|               |PGCluster2|
 |JAVA App  |               |          |
 |__________|               |__________|



JAVA AppにてPreparedStatementを使用しているため、
"postgresql.conf" の "not_replicate_prepared_select = true"で
試験しています。



以下が設定とログになりますが、
何か分かることがありましたらご教授下さい。

### 構成 ###

PGCluster1、PGCluster2共に、
pglb、cluster、pgreplicateで構成しており、

PGCluster1 の cluster には、
 ・PGCluster1 の pgreplicate
 ・PGCluster2 の pgreplicate
をレプリケーターとして設定し、
PGCluster2 の cluster にも、
 ・PGCluster1 の pgreplicate
 ・PGCluster2 の pgreplicate
をレプリケーターとして設定しています。

PGCluster1 の pgreplicate には、
 ・PGCluster1 の cluster
 ・PGCluster2 の cluster
をクラスターとして設定し、
PGCluster2 の pgreplicate には、
 ・PGCluster1 の cluster
 ・PGCluster2 の cluster
をクラスターとして設定しています。



### ログ ###

・PGCluster1
OS:RHEL5.4
ホスト名:TESTSV
IP:10.100.10.151

・PGCluster2
OS:RHEL5.4
ホスト名:TESTSV-BAK
IP:10.100.10.152

※PGCluster1 の pglbにはAliasIPとして、10.100.10.51を割り当てています

pgreplicate.log 抜粋

処理開始
Thu Oct  7 16:59:00 2010  PGRdo_replicate():query ::
Thu Oct  7 16:59:00 2010  cmdSts=P
Thu Oct  7 16:59:00 2010  cmdType=P
Thu Oct  7 16:59:00 2010  rlog=0
Thu Oct  7 16:59:00 2010  port=5431
Thu Oct  7 16:59:00 2010  pid=26803
Thu Oct  7 16:59:00 2010  from_host=10.100.10.151
Thu Oct  7 16:59:00 2010  dbName=nxsexpdb
Thu Oct  7 16:59:00 2010  userName=nextgen
Thu Oct  7 16:59:00 2010  recieve sec=1286438340
Thu Oct  7 16:59:00 2010  recieve usec=258290
Thu Oct  7 16:59:00 2010  query_size=876
Thu Oct  7 16:59:00 2010  request_id=11
Thu Oct  7 16:59:00 2010  replicate_id=0
Thu Oct  7 16:59:00 2010  recovery_status=0
Thu Oct  7 16:59:00 2010  sem_lock [1] req
Thu Oct  7 16:59:00 2010  sem_lock [1] got it
Thu Oct  7 16:59:00 2010  source host
Thu Oct  7 16:59:00 2010  start thread_send_cluster()
Thu Oct  7 16:59:00 2010  thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct  7 16:59:00 2010  thread_send_cluster():pthread_exit[1]
Thu Oct  7 16:59:00 2010  end thread_send_source()
Thu Oct  7 16:59:00 2010  sem_unlock[1]
Thu Oct  7 16:59:00 2010  PGRdo_replicate():query ::
Thu Oct  7 16:59:00 2010  cmdSts=P
Thu Oct  7 16:59:00 2010  cmdType=B
Thu Oct  7 16:59:00 2010  rlog=0
Thu Oct  7 16:59:00 2010  port=5431
Thu Oct  7 16:59:00 2010  pid=26803
Thu Oct  7 16:59:00 2010  from_host=10.100.10.151
Thu Oct  7 16:59:00 2010  dbName=nxsexpdb
Thu Oct  7 16:59:00 2010  userName=nextgen
Thu Oct  7 16:59:00 2010  recieve sec=1286438340
Thu Oct  7 16:59:00 2010  recieve usec=259728
Thu Oct  7 16:59:00 2010  query_size=207
Thu Oct  7 16:59:00 2010  request_id=12
Thu Oct  7 16:59:00 2010  replicate_id=0
Thu Oct  7 16:59:00 2010  recovery_status=0
Thu Oct  7 16:59:00 2010  sem_lock [1] req
Thu Oct  7 16:59:00 2010  sem_lock [1] got it
Thu Oct  7 16:59:00 2010  source host
Thu Oct  7 16:59:00 2010  start thread_send_cluster()
Thu Oct  7 16:59:00 2010  thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct  7 16:59:00 2010  thread_send_cluster():pthread_exit[1]
Thu Oct  7 16:59:00 2010  end thread_send_source()
Thu Oct  7 16:59:00 2010  sem_unlock[1]
Thu Oct  7 16:59:00 2010  PGRdo_replicate():query :: P
Thu Oct  7 16:59:00 2010  cmdSts=P
Thu Oct  7 16:59:00 2010  cmdType=D
Thu Oct  7 16:59:00 2010  rlog=0
Thu Oct  7 16:59:00 2010  port=5431
Thu Oct  7 16:59:00 2010  pid=26803
Thu Oct  7 16:59:00 2010  from_host=10.100.10.151
Thu Oct  7 16:59:00 2010  dbName=nxsexpdb
Thu Oct  7 16:59:00 2010  userName=nextgen
Thu Oct  7 16:59:00 2010  recieve sec=1286438340
Thu Oct  7 16:59:00 2010  recieve usec=260864
Thu Oct  7 16:59:00 2010  query_size=3
Thu Oct  7 16:59:00 2010  request_id=13
Thu Oct  7 16:59:00 2010  replicate_id=0
Thu Oct  7 16:59:00 2010  recovery_status=0
Thu Oct  7 16:59:00 2010  sem_lock [1] req
Thu Oct  7 16:59:00 2010  sem_lock [1] got it
Thu Oct  7 16:59:00 2010  source host
Thu Oct  7 16:59:00 2010  start thread_send_cluster()
Thu Oct  7 16:59:00 2010  thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct  7 16:59:00 2010  thread_send_cluster():pthread_exit[1]
Thu Oct  7 16:59:00 2010  end thread_send_source()
Thu Oct  7 16:59:00 2010  sem_unlock[1]
Thu Oct  7 16:59:00 2010  PGRdo_replicate():query ::
Thu Oct  7 16:59:00 2010  cmdSts=P
Thu Oct  7 16:59:00 2010  cmdType=E
Thu Oct  7 16:59:00 2010  rlog=0
Thu Oct  7 16:59:00 2010  port=5431
Thu Oct  7 16:59:00 2010  pid=26803
Thu Oct  7 16:59:00 2010  from_host=10.100.10.151
Thu Oct  7 16:59:00 2010  dbName=nxsexpdb
Thu Oct  7 16:59:00 2010  userName=nextgen
Thu Oct  7 16:59:00 2010  recieve sec=1286438340
Thu Oct  7 16:59:00 2010  recieve usec=261862
Thu Oct  7 16:59:00 2010  query_size=6
Thu Oct  7 16:59:00 2010  request_id=14
Thu Oct  7 16:59:00 2010  replicate_id=0
Thu Oct  7 16:59:00 2010  recovery_status=0
Thu Oct  7 16:59:00 2010  sem_lock [1] req
Thu Oct  7 16:59:00 2010  sem_lock [1] got it
Thu Oct  7 16:59:00 2010  source host
Thu Oct  7 16:59:00 2010  start thread_send_cluster()
Thu Oct  7 16:59:00 2010  thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct  7 16:59:00 2010  thread_send_cluster():pthread_exit[1]
Thu Oct  7 16:59:00 2010  end thread_send_source()
Thu Oct  7 16:59:00 2010  sem_unlock[1]
Thu Oct  7 16:59:00 2010  PGRdo_replicate():query ::
Thu Oct  7 16:59:00 2010  cmdSts=P
Thu Oct  7 16:59:00 2010  cmdType=S
Thu Oct  7 16:59:00 2010  rlog=0
Thu Oct  7 16:59:00 2010  port=5431
Thu Oct  7 16:59:00 2010  pid=26803
Thu Oct  7 16:59:00 2010  from_host=10.100.10.151
Thu Oct  7 16:59:00 2010  dbName=nxsexpdb
Thu Oct  7 16:59:00 2010  userName=nextgen
Thu Oct  7 16:59:00 2010  recieve sec=1286438340
Thu Oct  7 16:59:00 2010  recieve usec=263037
Thu Oct  7 16:59:00 2010  query_size=1
Thu Oct  7 16:59:00 2010  request_id=15
Thu Oct  7 16:59:00 2010  replicate_id=0
Thu Oct  7 16:59:00 2010  recovery_status=0
Thu Oct  7 16:59:00 2010  sem_lock [1] req
Thu Oct  7 16:59:00 2010  sem_lock [1] got it
Thu Oct  7 16:59:00 2010  source host
Thu Oct  7 16:59:00 2010  start thread_send_cluster()       →LAN抜線障害発生
Thu Oct  7 16:59:05 2010  PGRcreateConn(): timeout was occured
Thu Oct  7 16:59:23 2010  PGRcreateConn(): timeout was occured
Thu Oct  7 16:59:41 2010  PGRcreateConn(): timeout was occured
Thu Oct  7 16:59:41 2010
PGRsend_load_balance_packet():host[10.100.10.51] port[6001]
→sem_unlockが呼ばれない
Thu Oct  7 17:00:01 2010  PGRdo_replicate():query ::
Thu Oct  7 17:00:01 2010  cmdSts=P
Thu Oct  7 17:00:01 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:00:01 2010  cmdType=S
Thu Oct  7 17:00:01 2010  rlog=3
Thu Oct  7 17:00:01 2010  port=5431
Thu Oct  7 17:00:01 2010  pid=26803
Thu Oct  7 17:00:01 2010  from_host=10.100.10.151
Thu Oct  7 17:00:01 2010  dbName=nxsexpdb
Thu Oct  7 17:00:01 2010  userName=nextgen
Thu Oct  7 17:00:01 2010  recieve sec=1286438401
Thu Oct  7 17:00:01 2010  recieve usec=269332
Thu Oct  7 17:00:01 2010  PGRdo_replicate():query ::
Thu Oct  7 17:00:01 2010  query_size=0
Thu Oct  7 17:00:01 2010  cmdSts=P
Thu Oct  7 17:00:01 2010  request_id=15
Thu Oct  7 17:00:01 2010  cmdType=P
Thu Oct  7 17:00:01 2010  replicate_id=0
Thu Oct  7 17:00:01 2010  rlog=0
Thu Oct  7 17:00:01 2010  recovery_status=0
Thu Oct  7 17:00:01 2010  port=5431
Thu Oct  7 17:00:01 2010  sem_lock [1] req
Thu Oct  7 17:00:01 2010  pid=28248
Thu Oct  7 17:00:01 2010  from_host=10.100.10.151
Thu Oct  7 17:00:01 2010  dbName=nxsexpdb
Thu Oct  7 17:00:01 2010  userName=nextgen
Thu Oct  7 17:00:01 2010  recieve sec=1286438401
Thu Oct  7 17:00:01 2010  recieve usec=269934
Thu Oct  7 17:00:01 2010  query_size=876
Thu Oct  7 17:00:01 2010  request_id=1
Thu Oct  7 17:00:01 2010  replicate_id=0
Thu Oct  7 17:00:01 2010  recovery_status=0
Thu Oct  7 17:00:01 2010  sem_lock [1] req
→lockが取得できずに固まる
Thu Oct  7 17:01:01 2010  PGRdo_replicate():query ::
Thu Oct  7 17:01:01 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:01:01 2010  cmdSts=P
Thu Oct  7 17:01:01 2010  cmdType=P
Thu Oct  7 17:01:01 2010  rlog=3
Thu Oct  7 17:01:01 2010  port=5431
Thu Oct  7 17:01:01 2010  pid=28248
Thu Oct  7 17:01:01 2010  from_host=10.100.10.151
Thu Oct  7 17:01:01 2010  dbName=nxsexpdb
Thu Oct  7 17:01:01 2010  userName=nextgen
Thu Oct  7 17:01:01 2010  recieve sec=1286438461
Thu Oct  7 17:01:01 2010  recieve usec=264844
Thu Oct  7 17:01:01 2010  query_size=0
Thu Oct  7 17:01:01 2010  request_id=1
Thu Oct  7 17:01:01 2010  replicate_id=0
Thu Oct  7 17:01:01 2010  recovery_status=0
Thu Oct  7 17:01:01 2010  sem_lock [1] req
→lockが取得できずに固まる
Thu Oct  7 17:01:04 2010  PGRdo_replicate():query ::
Thu Oct  7 17:01:04 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:01:04 2010  cmdSts=P
Thu Oct  7 17:01:04 2010  cmdType=S
Thu Oct  7 17:01:04 2010  rlog=3
Thu Oct  7 17:01:04 2010  port=5431
Thu Oct  7 17:01:04 2010  pid=26803
Thu Oct  7 17:01:04 2010  from_host=10.100.10.151
Thu Oct  7 17:01:04 2010  dbName=nxsexpdb
Thu Oct  7 17:01:04 2010  userName=nextgen
Thu Oct  7 17:01:04 2010  recieve sec=1286438464
Thu Oct  7 17:01:04 2010  recieve usec=267610
Thu Oct  7 17:01:04 2010  query_size=0
Thu Oct  7 17:01:04 2010  request_id=15
Thu Oct  7 17:01:04 2010  replicate_id=0
Thu Oct  7 17:01:04 2010  recovery_status=0
Thu Oct  7 17:01:04 2010  sem_lock [1] req
→lockが取得できずに固まる
Thu Oct  7 17:02:01 2010  PGRdo_replicate():query ::
Thu Oct  7 17:02:01 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:02:01 2010  cmdSts=P
Thu Oct  7 17:02:01 2010  cmdType=P
Thu Oct  7 17:02:01 2010  rlog=3
Thu Oct  7 17:02:01 2010  port=5431
Thu Oct  7 17:02:01 2010  pid=28248
Thu Oct  7 17:02:01 2010  from_host=10.100.10.151
Thu Oct  7 17:02:01 2010  dbName=nxsexpdb
Thu Oct  7 17:02:01 2010  userName=nextgen
Thu Oct  7 17:02:01 2010  recieve sec=1286438521
Thu Oct  7 17:02:01 2010  recieve usec=260350
Thu Oct  7 17:02:01 2010  query_size=0
Thu Oct  7 17:02:01 2010  request_id=1
Thu Oct  7 17:02:01 2010  replicate_id=0
Thu Oct  7 17:02:01 2010  recovery_status=0
Thu Oct  7 17:02:01 2010  sem_lock [1] req
→lockが取得できずに固まる
Thu Oct  7 17:02:04 2010  PGRdo_replicate():query ::
Thu Oct  7 17:02:04 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:02:04 2010  cmdSts=P
Thu Oct  7 17:02:04 2010  cmdType=S
Thu Oct  7 17:02:04 2010  rlog=3
Thu Oct  7 17:02:04 2010  port=5431
Thu Oct  7 17:02:04 2010  pid=26803
Thu Oct  7 17:02:04 2010  from_host=10.100.10.151
Thu Oct  7 17:02:04 2010  dbName=nxsexpdb
Thu Oct  7 17:02:04 2010  userName=nextgen
Thu Oct  7 17:02:04 2010  recieve sec=1286438524
Thu Oct  7 17:02:04 2010  recieve usec=269091
Thu Oct  7 17:02:04 2010  query_size=0
Thu Oct  7 17:02:04 2010  request_id=15
Thu Oct  7 17:02:04 2010  replicate_id=0
Thu Oct  7 17:02:04 2010  recovery_status=0
Thu Oct  7 17:02:04 2010  sem_lock [1] req
→lockが取得できずに固まる
Thu Oct  7 17:03:01 2010  PGRdo_replicate():query ::
Thu Oct  7 17:03:01 2010  PGRwrite_rlog():delete_query_log
Thu Oct  7 17:03:01 2010  cmdSts=P
Thu Oct  7 17:03:01 2010  cmdType=P
Thu Oct  7 17:03:01 2010  rlog=3
Thu Oct  7 17:03:01 2010  port=5431
Thu Oct  7 17:03:01 2010  pid=28248
Thu Oct  7 17:03:01 2010  from_host=10.100.10.151
Thu Oct  7 17:03:01 2010  dbName=nxsexpdb
Thu Oct  7 17:03:01 2010  userName=nextgen
Thu Oct  7 17:03:01 2010  recieve sec=1286438581
Thu Oct  7 17:03:01 2010  recieve usec=255863
Thu Oct  7 17:03:01 2010  query_size=0
Thu Oct  7 17:03:01 2010  request_id=1
Thu Oct  7 17:03:01 2010  replicate_id=0
Thu Oct  7 17:03:01 2010  recovery_status=0
Thu Oct  7 17:03:01 2010  sem_lock [1] req
→lockが取得できずに固まる


■コンフィグ

cluster.conf、pglb.conf、pgreplicate.confの設定内容を記載します。
設定は0系(基本的にactで動かす系)のものを記載しています。
(1系はホストの設定が0系と逆になっているだけで後は同じ設定です)


●cluster.conf
#============================================================
#          Cluster DB Server configuration file
#------------------------------------------------------------
# file: cluster.conf
#------------------------------------------------------------
<Replicate_Server_Info>
      <Host_Name>     TESTSV          </Host_Name>
      <Port>                  8001                            </Port>
      <Recovery_Port>         8101                            </Recovery_Port>
</Replicate_Server_Info>
<Replicate_Server_Info>
      <Host_Name>     TESTSV-BAK              </Host_Name>
      <Port>                  8001                            </Port>
      <Recovery_Port>         8101                            </Recovery_Port>
</Replicate_Server_Info>
#-------------------------------------------------------------
<Host_Name>                     TESTSV          </Host_Name>
<Recovery_Port>                 7001                    </Recovery_Port>
<Rsync_Path>                    /usr/bin/rsync                  </Rsync_Path>
<Rsync_Option>                  ssh -2                  </Rsync_Option>
<Rsync_Compress>                yes
</Rsync_Compress>
<Rsync_Timeout>                 1min                    </Rsync_Timeout>
#<Rsync_Timeout>                        10min                   </Rsync_Timeout>
<Rsync_Bwlimit>                 0KB                             </Rsync_Bwlimit>
<Pg_Dump_Path>                  /usr/local/pgsql/bin/pg_dump    </Pg_Dump_Path>
<Ping_Path>                             /bin/ping               </Ping_Path>
<When_Stand_Alone>              read_only               </When_Stand_Alone>
<Replication_Timeout>   1min            </Replication_Timeout>
<LifeCheck_Timeout>             3s                      </LifeCheck_Timeout>
<LifeCheck_Interval>    11s                     </LifeCheck_Interval>
#-------------------------------------------------------------
<Not_Replicate_Info>
      <DB_Name>               *               </DB_Name>
      <Table_Name>    *               </Table_Name>
      <Query> begin; select getdatabaseencoding(); commit </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
      <DB_Name>               *               </DB_Name>
      <Table_Name>    *               </Table_Name>
      <Query> BEGIN; SELECT usesuper FROM pg_catalog.pg_user WHERE usename
= '%s'; COMMIT </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
      <DB_Name>               *               </DB_Name>
      <Table_Name>    *               </Table_Name>
      <Query> SET autocommit TO 'on' </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
      <DB_Name>               *               </DB_Name>
      <Table_Name>    *               </Table_Name>
      <Query> SET search_path = public </Query>
</Not_Replicate_Info>


●pglb.conf
#============================================================
#          Load Balance Server configuration file
#-------------------------------------------------------------
# file: pglb.conf
#-------------------------------------------------------------
<Cluster_Server_Info>
  <Host_Name>         TESTSV          </Host_Name>
  <Port>              5431            </Port>
  <Max_Connect>       32              </Max_Connect>
</Cluster_Server_Info>
#-------------------------------------------------------------
<Host_Name>     0.0.0.0         </Host_Name>
<Backend_Socket_Dir>            /tmp            </Backend_Socket_Dir>
<Receive_Port>                          5432            </Receive_Port>
<Recovery_Port>                         6001            </Recovery_Port>
<Max_Cluster_Num>                       128
</Max_Cluster_Num>
<Use_Connection_Pooling>        yes
</Use_Connection_Pooling>
<LifeCheck_Timeout>                     3s
</LifeCheck_Timeout>
<LifeCheck_Interval>            15s                     </LifeCheck_Interval>
<Connection_Life_Time>          0s                      </Connection_Life_Time>
#-------------------------------------------------------------
<Log_File_Name>         /var/log/pgsql/pglb.log </Log_File_Name>
<Log_File_Size>         10K                     </Log_File_Size>
<Log_Rotate>            3                       </Log_Rotate>

●pgreplicate.conf

#=============================================================
#  PGReplicate configuration file
#-------------------------------------------------------------
# file: pgreplicate.conf
#-------------------------------------------------------------
#-------------------------------------------------------------
<Cluster_Server_Info>
  <Host_Name>         TESTSV          </Host_Name>
  <Port>              5431            </Port>
  <Recovery_Port>     7001            </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
  <Host_Name>         TESTSV-BAK              </Host_Name>
  <Port>              5431            </Port>
  <Recovery_Port>     7001            </Recovery_Port>
</Cluster_Server_Info>
#<Cluster_Server_Info>
#-------------------------------------------------------------
#-------------------------------------------------------------
<LoadBalance_Server_Info>
      <Host_Name>             TESTSV-FIP      </Host_Name>
      <Recovery_Port>         6001    </Recovery_Port>
</LoadBalance_Server_Info>
#-------------------------------------------------------------
<Host_Name>     TESTSV          </Host_Name>
<Replication_Port>      8001            </Replication_Port>
<Recovery_Port>         8101            </Recovery_Port>
<RLOG_Port>                             8301            </RLOG_Port>
#<Use_Replication_Log>  no                      </Use_Replication_Log>
<Use_Replication_Log>   yes                     </Use_Replication_Log>
<Replication_Timeout>   1min            </Replication_Timeout>
<LifeCheck_Timeout>             3s                      </LifeCheck_Timeout>
<LifeCheck_Interval>    15s                     </LifeCheck_Interval>
#-------------------------------------------------------------
<Log_File_Name>         /var/log/pgsql/pgreplicate.log  </Log_File_Name>
<Log_File_Size>         100K                            </Log_File_Size>
<Log_Rotate>            3                               </Log_Rotate>



pgcluster メーリングリストの案内