[pgcluster: 1045] SbyLAN障害時のINSERTでロックが発生する件
村上智一
s1puser24 @ gmail.com
2010年 10月 16日 (土) 13:06:26 JST
はじめまして。村上と申します。
Linuxサーバでpgclusterを使い冗長構成をとっているのですが、Sby系の
LAN抜線後にDB更新すると応答が返ってこない状態になります。
解決方法についてご教示いただければ幸いです。
■問題内容
PgClusterバージョン:pgcluster-1.9.0rc6
以下の構成で、PGCluster2 サーバ にて
LAN抜線による障害を発生させたときに、
JAVAアプリケーションから
PGCluster1 サーバへの、JDBC経由のINSERTを行うと、
応答が返ってこない状況になります。
※常に発生するわけではなく、
抜線直後にINSERTすると発生する確率が高いようです。
※この事象が発生すると、
そのときに書込みを行っていたテーブルへのその後のアクセスが不能になります。
(他のテーブルへのアクセスは問題なくできます。)
_________
_________| switch |________こちらのLANを抜線
| |________| |
_____|_____ _____|_____
|PGCluster1| |PGCluster2|
|JAVA App | | |
|__________| |__________|
JAVA AppにてPreparedStatementを使用しているため、
"postgresql.conf" の "not_replicate_prepared_select = true"で
試験しています。
以下が設定とログになりますが、
何か分かることがありましたらご教授下さい。
### 構成 ###
PGCluster1、PGCluster2共に、
pglb、cluster、pgreplicateで構成しており、
PGCluster1 の cluster には、
・PGCluster1 の pgreplicate
・PGCluster2 の pgreplicate
をレプリケーターとして設定し、
PGCluster2 の cluster にも、
・PGCluster1 の pgreplicate
・PGCluster2 の pgreplicate
をレプリケーターとして設定しています。
PGCluster1 の pgreplicate には、
・PGCluster1 の cluster
・PGCluster2 の cluster
をクラスターとして設定し、
PGCluster2 の pgreplicate には、
・PGCluster1 の cluster
・PGCluster2 の cluster
をクラスターとして設定しています。
### ログ ###
・PGCluster1
OS:RHEL5.4
ホスト名:TESTSV
IP:10.100.10.151
・PGCluster2
OS:RHEL5.4
ホスト名:TESTSV-BAK
IP:10.100.10.152
※PGCluster1 の pglbにはAliasIPとして、10.100.10.51を割り当てています
pgreplicate.log 抜粋
処理開始
Thu Oct 7 16:59:00 2010 PGRdo_replicate():query ::
Thu Oct 7 16:59:00 2010 cmdSts=P
Thu Oct 7 16:59:00 2010 cmdType=P
Thu Oct 7 16:59:00 2010 rlog=0
Thu Oct 7 16:59:00 2010 port=5431
Thu Oct 7 16:59:00 2010 pid=26803
Thu Oct 7 16:59:00 2010 from_host=10.100.10.151
Thu Oct 7 16:59:00 2010 dbName=nxsexpdb
Thu Oct 7 16:59:00 2010 userName=nextgen
Thu Oct 7 16:59:00 2010 recieve sec=1286438340
Thu Oct 7 16:59:00 2010 recieve usec=258290
Thu Oct 7 16:59:00 2010 query_size=876
Thu Oct 7 16:59:00 2010 request_id=11
Thu Oct 7 16:59:00 2010 replicate_id=0
Thu Oct 7 16:59:00 2010 recovery_status=0
Thu Oct 7 16:59:00 2010 sem_lock [1] req
Thu Oct 7 16:59:00 2010 sem_lock [1] got it
Thu Oct 7 16:59:00 2010 source host
Thu Oct 7 16:59:00 2010 start thread_send_cluster()
Thu Oct 7 16:59:00 2010 thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct 7 16:59:00 2010 thread_send_cluster():pthread_exit[1]
Thu Oct 7 16:59:00 2010 end thread_send_source()
Thu Oct 7 16:59:00 2010 sem_unlock[1]
Thu Oct 7 16:59:00 2010 PGRdo_replicate():query ::
Thu Oct 7 16:59:00 2010 cmdSts=P
Thu Oct 7 16:59:00 2010 cmdType=B
Thu Oct 7 16:59:00 2010 rlog=0
Thu Oct 7 16:59:00 2010 port=5431
Thu Oct 7 16:59:00 2010 pid=26803
Thu Oct 7 16:59:00 2010 from_host=10.100.10.151
Thu Oct 7 16:59:00 2010 dbName=nxsexpdb
Thu Oct 7 16:59:00 2010 userName=nextgen
Thu Oct 7 16:59:00 2010 recieve sec=1286438340
Thu Oct 7 16:59:00 2010 recieve usec=259728
Thu Oct 7 16:59:00 2010 query_size=207
Thu Oct 7 16:59:00 2010 request_id=12
Thu Oct 7 16:59:00 2010 replicate_id=0
Thu Oct 7 16:59:00 2010 recovery_status=0
Thu Oct 7 16:59:00 2010 sem_lock [1] req
Thu Oct 7 16:59:00 2010 sem_lock [1] got it
Thu Oct 7 16:59:00 2010 source host
Thu Oct 7 16:59:00 2010 start thread_send_cluster()
Thu Oct 7 16:59:00 2010 thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct 7 16:59:00 2010 thread_send_cluster():pthread_exit[1]
Thu Oct 7 16:59:00 2010 end thread_send_source()
Thu Oct 7 16:59:00 2010 sem_unlock[1]
Thu Oct 7 16:59:00 2010 PGRdo_replicate():query :: P
Thu Oct 7 16:59:00 2010 cmdSts=P
Thu Oct 7 16:59:00 2010 cmdType=D
Thu Oct 7 16:59:00 2010 rlog=0
Thu Oct 7 16:59:00 2010 port=5431
Thu Oct 7 16:59:00 2010 pid=26803
Thu Oct 7 16:59:00 2010 from_host=10.100.10.151
Thu Oct 7 16:59:00 2010 dbName=nxsexpdb
Thu Oct 7 16:59:00 2010 userName=nextgen
Thu Oct 7 16:59:00 2010 recieve sec=1286438340
Thu Oct 7 16:59:00 2010 recieve usec=260864
Thu Oct 7 16:59:00 2010 query_size=3
Thu Oct 7 16:59:00 2010 request_id=13
Thu Oct 7 16:59:00 2010 replicate_id=0
Thu Oct 7 16:59:00 2010 recovery_status=0
Thu Oct 7 16:59:00 2010 sem_lock [1] req
Thu Oct 7 16:59:00 2010 sem_lock [1] got it
Thu Oct 7 16:59:00 2010 source host
Thu Oct 7 16:59:00 2010 start thread_send_cluster()
Thu Oct 7 16:59:00 2010 thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct 7 16:59:00 2010 thread_send_cluster():pthread_exit[1]
Thu Oct 7 16:59:00 2010 end thread_send_source()
Thu Oct 7 16:59:00 2010 sem_unlock[1]
Thu Oct 7 16:59:00 2010 PGRdo_replicate():query ::
Thu Oct 7 16:59:00 2010 cmdSts=P
Thu Oct 7 16:59:00 2010 cmdType=E
Thu Oct 7 16:59:00 2010 rlog=0
Thu Oct 7 16:59:00 2010 port=5431
Thu Oct 7 16:59:00 2010 pid=26803
Thu Oct 7 16:59:00 2010 from_host=10.100.10.151
Thu Oct 7 16:59:00 2010 dbName=nxsexpdb
Thu Oct 7 16:59:00 2010 userName=nextgen
Thu Oct 7 16:59:00 2010 recieve sec=1286438340
Thu Oct 7 16:59:00 2010 recieve usec=261862
Thu Oct 7 16:59:00 2010 query_size=6
Thu Oct 7 16:59:00 2010 request_id=14
Thu Oct 7 16:59:00 2010 replicate_id=0
Thu Oct 7 16:59:00 2010 recovery_status=0
Thu Oct 7 16:59:00 2010 sem_lock [1] req
Thu Oct 7 16:59:00 2010 sem_lock [1] got it
Thu Oct 7 16:59:00 2010 source host
Thu Oct 7 16:59:00 2010 start thread_send_cluster()
Thu Oct 7 16:59:00 2010 thread_send_cluster():return value from
send_replicate_packet_to_server() is 0
Thu Oct 7 16:59:00 2010 thread_send_cluster():pthread_exit[1]
Thu Oct 7 16:59:00 2010 end thread_send_source()
Thu Oct 7 16:59:00 2010 sem_unlock[1]
Thu Oct 7 16:59:00 2010 PGRdo_replicate():query ::
Thu Oct 7 16:59:00 2010 cmdSts=P
Thu Oct 7 16:59:00 2010 cmdType=S
Thu Oct 7 16:59:00 2010 rlog=0
Thu Oct 7 16:59:00 2010 port=5431
Thu Oct 7 16:59:00 2010 pid=26803
Thu Oct 7 16:59:00 2010 from_host=10.100.10.151
Thu Oct 7 16:59:00 2010 dbName=nxsexpdb
Thu Oct 7 16:59:00 2010 userName=nextgen
Thu Oct 7 16:59:00 2010 recieve sec=1286438340
Thu Oct 7 16:59:00 2010 recieve usec=263037
Thu Oct 7 16:59:00 2010 query_size=1
Thu Oct 7 16:59:00 2010 request_id=15
Thu Oct 7 16:59:00 2010 replicate_id=0
Thu Oct 7 16:59:00 2010 recovery_status=0
Thu Oct 7 16:59:00 2010 sem_lock [1] req
Thu Oct 7 16:59:00 2010 sem_lock [1] got it
Thu Oct 7 16:59:00 2010 source host
Thu Oct 7 16:59:00 2010 start thread_send_cluster() →LAN抜線障害発生
Thu Oct 7 16:59:05 2010 PGRcreateConn(): timeout was occured
Thu Oct 7 16:59:23 2010 PGRcreateConn(): timeout was occured
Thu Oct 7 16:59:41 2010 PGRcreateConn(): timeout was occured
Thu Oct 7 16:59:41 2010
PGRsend_load_balance_packet():host[10.100.10.51] port[6001]
→sem_unlockが呼ばれない
Thu Oct 7 17:00:01 2010 PGRdo_replicate():query ::
Thu Oct 7 17:00:01 2010 cmdSts=P
Thu Oct 7 17:00:01 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:00:01 2010 cmdType=S
Thu Oct 7 17:00:01 2010 rlog=3
Thu Oct 7 17:00:01 2010 port=5431
Thu Oct 7 17:00:01 2010 pid=26803
Thu Oct 7 17:00:01 2010 from_host=10.100.10.151
Thu Oct 7 17:00:01 2010 dbName=nxsexpdb
Thu Oct 7 17:00:01 2010 userName=nextgen
Thu Oct 7 17:00:01 2010 recieve sec=1286438401
Thu Oct 7 17:00:01 2010 recieve usec=269332
Thu Oct 7 17:00:01 2010 PGRdo_replicate():query ::
Thu Oct 7 17:00:01 2010 query_size=0
Thu Oct 7 17:00:01 2010 cmdSts=P
Thu Oct 7 17:00:01 2010 request_id=15
Thu Oct 7 17:00:01 2010 cmdType=P
Thu Oct 7 17:00:01 2010 replicate_id=0
Thu Oct 7 17:00:01 2010 rlog=0
Thu Oct 7 17:00:01 2010 recovery_status=0
Thu Oct 7 17:00:01 2010 port=5431
Thu Oct 7 17:00:01 2010 sem_lock [1] req
Thu Oct 7 17:00:01 2010 pid=28248
Thu Oct 7 17:00:01 2010 from_host=10.100.10.151
Thu Oct 7 17:00:01 2010 dbName=nxsexpdb
Thu Oct 7 17:00:01 2010 userName=nextgen
Thu Oct 7 17:00:01 2010 recieve sec=1286438401
Thu Oct 7 17:00:01 2010 recieve usec=269934
Thu Oct 7 17:00:01 2010 query_size=876
Thu Oct 7 17:00:01 2010 request_id=1
Thu Oct 7 17:00:01 2010 replicate_id=0
Thu Oct 7 17:00:01 2010 recovery_status=0
Thu Oct 7 17:00:01 2010 sem_lock [1] req
→lockが取得できずに固まる
Thu Oct 7 17:01:01 2010 PGRdo_replicate():query ::
Thu Oct 7 17:01:01 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:01:01 2010 cmdSts=P
Thu Oct 7 17:01:01 2010 cmdType=P
Thu Oct 7 17:01:01 2010 rlog=3
Thu Oct 7 17:01:01 2010 port=5431
Thu Oct 7 17:01:01 2010 pid=28248
Thu Oct 7 17:01:01 2010 from_host=10.100.10.151
Thu Oct 7 17:01:01 2010 dbName=nxsexpdb
Thu Oct 7 17:01:01 2010 userName=nextgen
Thu Oct 7 17:01:01 2010 recieve sec=1286438461
Thu Oct 7 17:01:01 2010 recieve usec=264844
Thu Oct 7 17:01:01 2010 query_size=0
Thu Oct 7 17:01:01 2010 request_id=1
Thu Oct 7 17:01:01 2010 replicate_id=0
Thu Oct 7 17:01:01 2010 recovery_status=0
Thu Oct 7 17:01:01 2010 sem_lock [1] req
→lockが取得できずに固まる
Thu Oct 7 17:01:04 2010 PGRdo_replicate():query ::
Thu Oct 7 17:01:04 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:01:04 2010 cmdSts=P
Thu Oct 7 17:01:04 2010 cmdType=S
Thu Oct 7 17:01:04 2010 rlog=3
Thu Oct 7 17:01:04 2010 port=5431
Thu Oct 7 17:01:04 2010 pid=26803
Thu Oct 7 17:01:04 2010 from_host=10.100.10.151
Thu Oct 7 17:01:04 2010 dbName=nxsexpdb
Thu Oct 7 17:01:04 2010 userName=nextgen
Thu Oct 7 17:01:04 2010 recieve sec=1286438464
Thu Oct 7 17:01:04 2010 recieve usec=267610
Thu Oct 7 17:01:04 2010 query_size=0
Thu Oct 7 17:01:04 2010 request_id=15
Thu Oct 7 17:01:04 2010 replicate_id=0
Thu Oct 7 17:01:04 2010 recovery_status=0
Thu Oct 7 17:01:04 2010 sem_lock [1] req
→lockが取得できずに固まる
Thu Oct 7 17:02:01 2010 PGRdo_replicate():query ::
Thu Oct 7 17:02:01 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:02:01 2010 cmdSts=P
Thu Oct 7 17:02:01 2010 cmdType=P
Thu Oct 7 17:02:01 2010 rlog=3
Thu Oct 7 17:02:01 2010 port=5431
Thu Oct 7 17:02:01 2010 pid=28248
Thu Oct 7 17:02:01 2010 from_host=10.100.10.151
Thu Oct 7 17:02:01 2010 dbName=nxsexpdb
Thu Oct 7 17:02:01 2010 userName=nextgen
Thu Oct 7 17:02:01 2010 recieve sec=1286438521
Thu Oct 7 17:02:01 2010 recieve usec=260350
Thu Oct 7 17:02:01 2010 query_size=0
Thu Oct 7 17:02:01 2010 request_id=1
Thu Oct 7 17:02:01 2010 replicate_id=0
Thu Oct 7 17:02:01 2010 recovery_status=0
Thu Oct 7 17:02:01 2010 sem_lock [1] req
→lockが取得できずに固まる
Thu Oct 7 17:02:04 2010 PGRdo_replicate():query ::
Thu Oct 7 17:02:04 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:02:04 2010 cmdSts=P
Thu Oct 7 17:02:04 2010 cmdType=S
Thu Oct 7 17:02:04 2010 rlog=3
Thu Oct 7 17:02:04 2010 port=5431
Thu Oct 7 17:02:04 2010 pid=26803
Thu Oct 7 17:02:04 2010 from_host=10.100.10.151
Thu Oct 7 17:02:04 2010 dbName=nxsexpdb
Thu Oct 7 17:02:04 2010 userName=nextgen
Thu Oct 7 17:02:04 2010 recieve sec=1286438524
Thu Oct 7 17:02:04 2010 recieve usec=269091
Thu Oct 7 17:02:04 2010 query_size=0
Thu Oct 7 17:02:04 2010 request_id=15
Thu Oct 7 17:02:04 2010 replicate_id=0
Thu Oct 7 17:02:04 2010 recovery_status=0
Thu Oct 7 17:02:04 2010 sem_lock [1] req
→lockが取得できずに固まる
Thu Oct 7 17:03:01 2010 PGRdo_replicate():query ::
Thu Oct 7 17:03:01 2010 PGRwrite_rlog():delete_query_log
Thu Oct 7 17:03:01 2010 cmdSts=P
Thu Oct 7 17:03:01 2010 cmdType=P
Thu Oct 7 17:03:01 2010 rlog=3
Thu Oct 7 17:03:01 2010 port=5431
Thu Oct 7 17:03:01 2010 pid=28248
Thu Oct 7 17:03:01 2010 from_host=10.100.10.151
Thu Oct 7 17:03:01 2010 dbName=nxsexpdb
Thu Oct 7 17:03:01 2010 userName=nextgen
Thu Oct 7 17:03:01 2010 recieve sec=1286438581
Thu Oct 7 17:03:01 2010 recieve usec=255863
Thu Oct 7 17:03:01 2010 query_size=0
Thu Oct 7 17:03:01 2010 request_id=1
Thu Oct 7 17:03:01 2010 replicate_id=0
Thu Oct 7 17:03:01 2010 recovery_status=0
Thu Oct 7 17:03:01 2010 sem_lock [1] req
→lockが取得できずに固まる
■コンフィグ
cluster.conf、pglb.conf、pgreplicate.confの設定内容を記載します。
設定は0系(基本的にactで動かす系)のものを記載しています。
(1系はホストの設定が0系と逆になっているだけで後は同じ設定です)
●cluster.conf
#============================================================
# Cluster DB Server configuration file
#------------------------------------------------------------
# file: cluster.conf
#------------------------------------------------------------
<Replicate_Server_Info>
<Host_Name> TESTSV </Host_Name>
<Port> 8001 </Port>
<Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Replicate_Server_Info>
<Host_Name> TESTSV-BAK </Host_Name>
<Port> 8001 </Port>
<Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
#-------------------------------------------------------------
<Host_Name> TESTSV </Host_Name>
<Recovery_Port> 7001 </Recovery_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh -2 </Rsync_Option>
<Rsync_Compress> yes
</Rsync_Compress>
<Rsync_Timeout> 1min </Rsync_Timeout>
#<Rsync_Timeout> 10min </Rsync_Timeout>
<Rsync_Bwlimit> 0KB </Rsync_Bwlimit>
<Pg_Dump_Path> /usr/local/pgsql/bin/pg_dump </Pg_Dump_Path>
<Ping_Path> /bin/ping </Ping_Path>
<When_Stand_Alone> read_only </When_Stand_Alone>
<Replication_Timeout> 1min </Replication_Timeout>
<LifeCheck_Timeout> 3s </LifeCheck_Timeout>
<LifeCheck_Interval> 11s </LifeCheck_Interval>
#-------------------------------------------------------------
<Not_Replicate_Info>
<DB_Name> * </DB_Name>
<Table_Name> * </Table_Name>
<Query> begin; select getdatabaseencoding(); commit </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
<DB_Name> * </DB_Name>
<Table_Name> * </Table_Name>
<Query> BEGIN; SELECT usesuper FROM pg_catalog.pg_user WHERE usename
= '%s'; COMMIT </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
<DB_Name> * </DB_Name>
<Table_Name> * </Table_Name>
<Query> SET autocommit TO 'on' </Query>
</Not_Replicate_Info>
<Not_Replicate_Info>
<DB_Name> * </DB_Name>
<Table_Name> * </Table_Name>
<Query> SET search_path = public </Query>
</Not_Replicate_Info>
●pglb.conf
#============================================================
# Load Balance Server configuration file
#-------------------------------------------------------------
# file: pglb.conf
#-------------------------------------------------------------
<Cluster_Server_Info>
<Host_Name> TESTSV </Host_Name>
<Port> 5431 </Port>
<Max_Connect> 32 </Max_Connect>
</Cluster_Server_Info>
#-------------------------------------------------------------
<Host_Name> 0.0.0.0 </Host_Name>
<Backend_Socket_Dir> /tmp </Backend_Socket_Dir>
<Receive_Port> 5432 </Receive_Port>
<Recovery_Port> 6001 </Recovery_Port>
<Max_Cluster_Num> 128
</Max_Cluster_Num>
<Use_Connection_Pooling> yes
</Use_Connection_Pooling>
<LifeCheck_Timeout> 3s
</LifeCheck_Timeout>
<LifeCheck_Interval> 15s </LifeCheck_Interval>
<Connection_Life_Time> 0s </Connection_Life_Time>
#-------------------------------------------------------------
<Log_File_Name> /var/log/pgsql/pglb.log </Log_File_Name>
<Log_File_Size> 10K </Log_File_Size>
<Log_Rotate> 3 </Log_Rotate>
●pgreplicate.conf
#=============================================================
# PGReplicate configuration file
#-------------------------------------------------------------
# file: pgreplicate.conf
#-------------------------------------------------------------
#-------------------------------------------------------------
<Cluster_Server_Info>
<Host_Name> TESTSV </Host_Name>
<Port> 5431 </Port>
<Recovery_Port> 7001 </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
<Host_Name> TESTSV-BAK </Host_Name>
<Port> 5431 </Port>
<Recovery_Port> 7001 </Recovery_Port>
</Cluster_Server_Info>
#<Cluster_Server_Info>
#-------------------------------------------------------------
#-------------------------------------------------------------
<LoadBalance_Server_Info>
<Host_Name> TESTSV-FIP </Host_Name>
<Recovery_Port> 6001 </Recovery_Port>
</LoadBalance_Server_Info>
#-------------------------------------------------------------
<Host_Name> TESTSV </Host_Name>
<Replication_Port> 8001 </Replication_Port>
<Recovery_Port> 8101 </Recovery_Port>
<RLOG_Port> 8301 </RLOG_Port>
#<Use_Replication_Log> no </Use_Replication_Log>
<Use_Replication_Log> yes </Use_Replication_Log>
<Replication_Timeout> 1min </Replication_Timeout>
<LifeCheck_Timeout> 3s </LifeCheck_Timeout>
<LifeCheck_Interval> 15s </LifeCheck_Interval>
#-------------------------------------------------------------
<Log_File_Name> /var/log/pgsql/pgreplicate.log </Log_File_Name>
<Log_File_Size> 100K </Log_File_Size>
<Log_Rotate> 3 </Log_Rotate>
pgcluster メーリングリストの案内