[pgcluster: 789] デッドロック発生後、無応答となる

2005年 6月 17日 (金) 12:11:01 JST

はじめまして。小林と申します。

PGCluster 1.1.0cフルパッケージ版をインストールして
高負荷時の動きを確認していたところ、
デッドロック発生後にロック解除をずっと待った状態となり、
無応答となってしまいました。
PGCluster 1.1.1a、PGCluster1.3.0cでも
同様に無応答となります。

クラスタDBを１台のみとした場合は、
デッドロック発生後、一方のトランザクションをロールバックします。

必ず発生しているわけではなく、
数回はロック解除を待機せずにロールバックするのですが、
ほとんどが無応答となります。

原因や解決方法がわかる方がいらっしゃいましたら
ご教授いただけないでしょうか？

また、同様な現象が発生している方はいらっしゃいますでしょうか？

よろしくお願いいたします。

＜使用バージョン＞
    PGCluster1.1.0c

＜OS＞
    Red Hat Enterprise Linux ES 3

＜マシン構成＞
    ロードバランサ＋レプリケーションサーバ　１台
    クラスタDB　２台

＜再現方法＞
    同一のテーブルに対し、以下の２つのUPDATE文を同時に実行する

    １．update t_reservation_management set cdate=current_timestamp;

    ２．update t_reservation_management set 
reservation_start_date='xxxxxx',reservation_last_date='xxxxx' where meeting_id ='';

    互いのUPDATE文は、異なるカラムを更新している。
    １はpsqlより手動でSQLを実行する。
    ２はJDBCで６０回繰り返し実行する。１回のループごとにコネクションは解放する。

       --------------------------------------------------------------

＜ログ＞
■レプリケーションサーバ
DEBUG:sem_lock[1]
DEBUG:send_sync_data():sync_command(SELECT PGR_SYSTEM_COMMAND_FUNCTION(3,1118969809,93361,33,1) )
DEBUG:PGRreturn_result():PGRreturn_result[]
DEBUG:PGRreturn_result():wait for answer
DEBUG:answer[PGR_QUERY_DONE_NOTICE_CMD]
DEBUG:read_answer():QUERY DONE
DEBUG:PGRreturn_result():status of PGRreturn_result[0]
DEBUG:sem_unlock[1]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=T
DEBUG:PGRreplicate_packet_send():cmdType=U
DEBUG:PGRreplicate_packet_send():rlog=0
DEBUG:PGRreplicate_packet_send():request_id=489
DEBUG:PGRreplicate_packet_send():replicate_id=0
DEBUG:PGRreplicate_packet_send():port=5432
DEBUG:PGRreplicate_packet_send():pid=3449
DEBUG:PGRreplicate_packet_send():from_host=clusterdb2.pgcluster.jp
DEBUG:PGRreplicate_packet_send():dbName=mm_meet
DEBUG:PGRreplicate_packet_send():userName=postgre
DEBUG:PGRreplicate_packet_send():recieve sec=1118969809
DEBUG:PGRreplicate_packet_send():recieve usec=95226
DEBUG:PGRreplicate_packet_send():query_size=147
DEBUG:PGRreplicate_packet_send():query=update t_reservation_management set
reservation_start_date='2005-06-17 09:42',reservation_last_date='2005-06-17 21:42' where meeting_id
='BHCA0091'
DEBUG:sem_lock[1]
DEBUG:send_sync_data():sync_command(SELECT PGR_SYSTEM_COMMAND_FUNCTION(3,1118969809,95226,33,1) )
DEBUG:PGRreturn_result():PGRreturn_result[]
DEBUG:PGRreturn_result():wait for answer
DEBUG:answer[PGR_QUERY_DONE_NOTICE_CMD]
DEBUG:read_answer():QUERY DONE
DEBUG:PGRreturn_result():status of PGRreturn_result[0]
DEBUG:sem_unlock[1]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=Q
DEBUG:PGRreplicate_packet_send():cmdType=U
DEBUG:PGRreplicate_packet_send():rlog=0
DEBUG:PGRreplicate_packet_send():request_id=4
DEBUG:PGRreplicate_packet_send():replicate_id=0
DEBUG:PGRreplicate_packet_send():port=5432
DEBUG:PGRreplicate_packet_send():pid=3591
DEBUG:PGRreplicate_packet_send():from_host=clusterdb1.pgcluster.jp
DEBUG:PGRreplicate_packet_send():dbName=mm_meet
DEBUG:PGRreplicate_packet_send():userName=postgre
DEBUG:PGRreplicate_packet_send():recieve sec=1118969809
DEBUG:PGRreplicate_packet_send():recieve usec=98656
DEBUG:PGRreplicate_packet_send():query_size=59
DEBUG:PGRreplicate_packet_send():query=update t_reservation_management set cdate=current_timestamp
DEBUG:sem_lock[1]
DEBUG:PGRreturn_result():PGRreturn_result[3,1118969809,98656,34,1]
DEBUG:PGRreturn_result():wait for answer
DEBUG:send_sync_data():sync_command(SELECT PGR_SYSTEM_COMMAND_FUNCTION(3,1118969809,98656,34,1) )
DEBUG:answer[PGR_LOCK_CONFLICT_NOTICE_CMD]
DEBUG:read_answer():LOCK CONFLICT !!
DEBUG:PGRreturn_result():status of PGRreturn_result[2]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=T
DEBUG:PGRreplicate_packet_send():cmdType=U
DEBUG:PGRreplicate_packet_send():rlog=0
DEBUG:PGRreplicate_packet_send():request_id=490
DEBUG:PGRreplicate_packet_send():replicate_id=0
DEBUG:PGRreplicate_packet_send():port=5432
DEBUG:PGRreplicate_packet_send():pid=3449
DEBUG:PGRreplicate_packet_send():from_host=clusterdb2.pgcluster.jp
DEBUG:PGRreplicate_packet_send():dbName=mm_meet
DEBUG:PGRreplicate_packet_send():userName=postgre
DEBUG:PGRreplicate_packet_send():recieve sec=1118969809
DEBUG:PGRreplicate_packet_send():recieve usec=102684
DEBUG:PGRreplicate_packet_send():query_size=147
DEBUG:PGRreplicate_packet_send():query=update t_reservation_management set
reservation_start_date='2005-06-17 09:42',reservation_last_date='2005-06-17 21:42' where meeting_id
='BHCA0101'
DEBUG:sem_lock[1]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=T
DEBUG:PGRreplicate_packet_send():cmdType=U
DEBUG:PGRreplicate_packet_send():rlog=3
DEBUG:PGRreplicate_packet_send():request_id=490
DEBUG:PGRreplicate_packet_send():replicate_id=0
DEBUG:PGRreplicate_packet_send():port=5432
DEBUG:PGRreplicate_packet_send():pid=3449
DEBUG:PGRreplicate_packet_send():from_host=clusterdb2.pgcluster.jp
DEBUG:PGRreplicate_packet_send():dbName=mm_meet
DEBUG:PGRreplicate_packet_send():userName=postgre
DEBUG:PGRreplicate_packet_send():recieve sec=1118969899
DEBUG:PGRreplicate_packet_send():recieve usec=112435
DEBUG:PGRreplicate_packet_send():query_size=147
DEBUG:PGRreplicate_packet_send():query=update t_reservation_management set
reservation_start_date='2005-06-17 09:42',reservation_last_date='2005-06-17 21:42' where meeting_id
='BHCA0101'
DEBUG:sem_lock[1]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=T
DEBUG:PGRreplicate_packet_send():cmdType=U
DEBUG:PGRreplicate_packet_send():rlog=3
DEBUG:PGRreplicate_packet_send():request_id=490
DEBUG:PGRreplicate_packet_send():replicate_id=0
DEBUG:PGRreplicate_packet_send():port=5432
DEBUG:PGRreplicate_packet_send():pid=3449
DEBUG:PGRreplicate_packet_send():from_host=clusterdb2.pgcluster.jp
DEBUG:PGRreplicate_packet_send():dbName=mm_meet
DEBUG:PGRreplicate_packet_send():userName=postgre
DEBUG:PGRreplicate_packet_send():recieve sec=1118969989
DEBUG:PGRreplicate_packet_send():recieve usec=121566
DEBUG:PGRreplicate_packet_send():query_size=147
DEBUG:PGRreplicate_packet_send():query=update t_reservation_management set
reservation_start_date='2005-06-17 09:42',reservation_last_date='2005-06-17 21:42' where meeting_id
='BHCA0101'
DEBUG:sem_lock[1]
DEBUG:replicate_loop():replicate_loop selected
DEBUG:PGRreplicate_packet_send():cmdSts=Q
・・・

■クラスタDB1（マスタDB）
LOG:  XX000: could not create IPv6 socket: Address family not supported by protocol
LOCATION:  StreamServerPort, pqcomm.c:334
LOG:  00000: database system was shut down at 2005-06-17 11:12:02 JST
LOCATION:  StartupXLOG, xlog.c:2596
LOG:  00000: checkpoint record is at 0/12A2314
LOCATION:  StartupXLOG, xlog.c:2628
LOG:  00000: redo record is at 0/12A2314; undo record is at 0/0; shutdown TRUE
LOCATION:  StartupXLOG, xlog.c:2653
LOG:  00000: next transaction ID: 2032; next OID: 33527
LOCATION:  StartupXLOG, xlog.c:2656
LOG:  00000: database system is ready
LOCATION:  StartupXLOG, xlog.c:2946
ERROR:  57014: canceling query due to user request
LOCATION:  ProcessInterrupts, postgres.c:2011
LOG:  08006: could not receive data from client: Connection reset by peer
LOCATION:  pq_recvbuf, pqcomm.c:701
LOG:  08P01: unexpected EOF on client connection
LOCATION:  SocketBackend, postgres.c:284
LOG:  00000: received smart shutdown request
LOCATION:  pmdie, postmaster.c:1750
FATAL:  57P01: terminating connection due to administrator command
LOCATION:  ProcessInterrupts, postgres.c:2002
FATAL:  57P01: terminating connection due to administrator command
LOCATION:  ProcessInterrupts, postgres.c:2002
LOG:  00000: shutting down
LOCATION:  ShutdownXLOG, xlog.c:3090
LOG:  00000: database system is shut down
LOCATION:  ShutdownXLOG, xlog.c:3104

■クラスタDB２
LOG:  XX000: could not create IPv6 socket: Address family not supported by protocol
LOCATION:  StreamServerPort, pqcomm.c:334
LOG:  00000: database system was shut down at 2005-06-17 10:09:53 JST
LOCATION:  StartupXLOG, xlog.c:2596
LOG:  00000: checkpoint record is at 0/D00E58
LOCATION:  StartupXLOG, xlog.c:2628
LOG:  00000: redo record is at 0/D00E58; undo record is at 0/0; shutdown TRUE
LOCATION:  StartupXLOG, xlog.c:2653
LOG:  00000: next transaction ID: 1011; next OID: 25449
LOCATION:  StartupXLOG, xlog.c:2656
LOG:  00000: database system is ready
LOCATION:  StartupXLOG, xlog.c:2946
LOG:  00000: received smart shutdown request
LOCATION:  pmdie, postmaster.c:1750
FATAL:  57P01: terminating connection due to administrator command
LOCATION:  ProcessInterrupts, postgres.c:2002
LOG:  00000: server process (PID 3530) was terminated by signal 11
LOCATION:  LogChildExit, postmaster.c:2132
LOG:  00000: terminating any other active server processes
LOCATION:  CleanupProc, postmaster.c:2053
WARNING:  57P02: terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOCATION:  quickdie, postgres.c:1843
LOG:  00000: all server processes terminated; reinitializing
LOCATION:  reaper, postmaster.c:1965
LOG:  00000: database system was interrupted at 2005-06-17 11:12:09 JST
LOCATION:  StartupXLOG, xlog.c:2610
LOG:  00000: checkpoint record is at 0/D00E58
LOCATION:  StartupXLOG, xlog.c:2628
LOG:  00000: redo record is at 0/D00E58; undo record is at 0/0; shutdown TRUE
LOCATION:  StartupXLOG, xlog.c:2653
LOG:  00000: next transaction ID: 1011; next OID: 25449
LOCATION:  StartupXLOG, xlog.c:2656
LOG:  00000: database system was not properly shut down; automatic recovery in progress
LOCATION:  StartupXLOG, xlog.c:2705
LOG:  00000: redo starts at 0/D00E98
LOCATION:  StartupXLOG, xlog.c:2733
LOG:  00000: record with zero length at 0/F4F414
LOCATION:  ReadRecord, xlog.c:1900
LOG:  00000: redo done at 0/F4F3F0
LOCATION:  StartupXLOG, xlog.c:2765
LOG:  00000: database system is ready
LOCATION:  StartupXLOG, xlog.c:2946
LOG:  00000: shutting down
LOCATION:  ShutdownXLOG, xlog.c:3090
LOG:  00000: database system is shut down
LOCATION:  ShutdownXLOG, xlog.c:3104

＜設定ファイル＞
 ■cluster.conf
#============================================================
#          Cluster DB Server configuration file
#------------------------------------------------------------
# file: cluster.conf
#------------------------------------------------------------
# This file controls:
#       o which hosts & port are replication server
#       o which port use for replication request to replication server
#       o which command use for recovery function
#============================================================
#------------------------------------------------------------
# set Replication Server information
#  o Host_Name : hostname
#  o Port : connection for postmaster
#  o Recovery_Port : connection for recovery process
#------------------------------------------------------------
<Replicate_Server_Info>
 <Host_Name> pgcluster.pgcluster.jp </Host_Name>
 <Port> 8001 </Port>
 <Recovery_Port> 8101 </Recovery_Port>
 <LifeCheck_Port> 8201 </LifeCheck_Port>
</Replicate_Server_Info>
#<Replicate_Server_Info>
# <Host_Name> replicate2.postgres.jp </Host_Name>
# <Port> 8002 </Port>
# <Recovery_Port> 8102 </Recovery_Port>
# <LifeCheck_Port> 8202 </LifeCheck_Port>
#</Replicate_Server_Info>
#<Replicate_Server_Info>
# <Host_Name> replicate3.postgres.jp </Host_Name>
# <Port> 8003 </Port>
# <Recovery_Port> 8103 </Recovery_Port>
# <LifeCheck_Port> 8203 </LifeCheck_Port>
#</Replicate_Server_Info>
#-------------------------------------------------------------
# set Cluster DB Server information
#  o Recovery_Port : connection for recovery
#  o Rsync_Path : path of rsync command
#  o Rsync_Option : file transfer option for rsync
#       o When_Stand_Alone : When all replication servers fell,
#                            you can set up two kinds of permittion,
#                            "real_only" or "read_write".
#-------------------------------------------------------------
<Recovery_Port> 7101 </Recovery_Port>
<LifeCheck_Port> 7201 </LifeCheck_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh -2 </Rsync_Option>
<When_Stand_Alone> read_only  </When_Stand_Alone>
<Status_Log_File>  /usr/local/pgsql/log/cluster.sts </Status_Log_File>
<Error_Log_File> /usr/local/pgsql/log/cluster.log  </Error_Log_File>
#-------------------------------------------------------------
# set partitional replicate control information
#     set DB name and Table name to stop reprication
#       o DB_Name : DB name
#       o Table_Name : table name
#-------------------------------------------------------------
#<Not_Replicate_Info>
# <DB_Name>     test_db      </DB_Name>
# <Table_Name>  log_table    </Table_Name>
#</Not_Replicate_Info>

■pglb.conf
#============================================================
#          Load Balance Server configuration file
#-------------------------------------------------------------
# file: pglb.conf
#-------------------------------------------------------------
# This file controls:
#       o which hosts are db cluster server
#       o which port  use connect to db cluster server
#       o how many connections are allowed on each DB server
#============================================================
#-------------------------------------------------------------
# set cluster DB server information
#  o Host_Name : hostname
#  o Port : connection for postmaster
#  o Max_Connection : maximun number of connection to postmaster
#-------------------------------------------------------------
<Cluster_Server_Info>
    <Host_Name>   clusterdb1.pgcluster.jp</Host_Name>
    <Port>        5432                </Port>
    <Max_Connect> 100                   </Max_Connect>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>   clusterdb2.pgcluster.jp  </Host_Name>
    <Port>        5432                </Port>
    <Max_Connect> 100                  </Max_Connect>
</Cluster_Server_Info>
#<Cluster_Server_Info>
#    <Host_Name>   post3.postgres.jp   </Host_Name>
#    <Port>        5432                </Port>
#    <Max_Connect> 32                  </Max_Connect>
#</Cluster_Server_Info>
#-------------------------------------------------------------
# set Load Balance server information
#  o Backend_Socket_Dir : Unix domain socket path for the backend
#  o Receive_Port : connection from client
#  o Recovery_Port : connection for recovery process
#  o LifeCheck_Port : connection for life check process
#  o Max_Cluster_Num : maximun number of cluster DB servers
#  o Use_Connection_Pooling : use connection pool [yes/no]
#  o Max_Pool_Each_Server : number of pool connections/DB server
#-------------------------------------------------------------
<Backend_Socket_Dir>    /tmp     </Backend_Socket_Dir>
<Receive_Port>          5432     </Receive_Port>
<Recovery_Port>         6101     </Recovery_Port>
<LifeCheck_Port>        6201     </LifeCheck_Port>
<Max_Cluster_Num>        128     </Max_Cluster_Num>
<Use_Connection_Pooling>  yes     </Use_Connection_Pooling>

■pgreplicate.conf
#=============================================================
#  PGReplicate configuration file
#                                     for  PGCluster-1.1.0a
#-------------------------------------------------------------
# file: pgreplicate.conf
#-------------------------------------------------------------
# This file controls:
#       o which hosts & port are cluster server
#       o which port use for replication request from cluster server
#=============================================================
#
#-------------------------------------------------------------
# A setup of Cluster DB(s)
#
#  o Host_Name : The host name of Cluster DB.
#                -- please write a host name by FQDN.
#                -- do not write IP address.
#  o Port : The connection port with postmaster.
#  o Recovery_Port : The connection port at the time of
#                    a recovery sequence .
#  o LifeCheck_Port : connection for life check process
#-------------------------------------------------------------
<Cluster_Server_Info>
    <Host_Name>   clusterdb1.pgcluster.jp  </Host_Name>
    <Port>                5432        </Port>
    <Recovery_Port>       7101        </Recovery_Port>
    <LifeCheck_Port>      7201        </LifeCheck_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>   clusterdb2.pgcluster.jp </Host_Name>
    <Port>                5432        </Port>
    <Recovery_Port>       7101        </Recovery_Port>
    <LifeCheck_Port>      7201        </LifeCheck_Port>
</Cluster_Server_Info>
#<Cluster_Server_Info>
#    <Host_Name>   cluster3.postgres.jp   </Host_Name>
#    <Port>                5432        </Port>
#    <Recovery_Port>       7101        </Recovery_Port>
#    <LifeCheck_Port>      7201        </LifeCheck_Port>
#</Cluster_Server_Info>
#
#-------------------------------------------------------------
# A setup of Load Balance Server
#
#  o Host_Name : The host name of a load balance server.
#                -- please write a host name by FQDN.
#                -- do not write IP address.
#  o Recovery_Port : The connection port at the time of
#                    a recovery sequence .
#  o LifeCheck_Port : connection for life check process
#-------------------------------------------------------------
<LoadBalance_Server_Info>
    <Host_Name>   pgcluster.pgcluster.jp  </Host_Name>
    <Recovery_Port>       6101        </Recovery_Port>
    <LifeCheck_Port>      6201        </LifeCheck_Port>
</LoadBalance_Server_Info>
#------------------------------------------------------------
# A setup of the upper replication server for cascade connection.
#
#  o Host_Name : The host name of Cluster DB.
#                -- please write a host name by FQDN.
#                -- do not write IP address.
#  o Port : The connection port with postmaster.
#  o Recovery_Port : The connection port at the time of
#                    a recovery sequence .
#  o LifeCheck_Port : connection for life check process
#------------------------------------------------------------
#<Replicate_Server_Info>
#    <Host_Name> upper_replicate.postgres.jp </Host_Name>
#    <Port>                   8001           </Port>
#    <Recovery_Port>          8101           </Recovery_Port>
#    <LifeCheck_Port>         8201           </LifeCheck_Port>
#</Replicate_Server_Info>
#
#-------------------------------------------------------------
# A setup of a replication server
#
#  o Status_Log_File : logging file of cluster db's status
#  o Error_Log_File : logging file of error and warning
#  o Replicate_Port : connection for reprication
#  o Recovery_Port : connection for recovery
#  o LifeCheck_Port : connection for life check process
#  o Response_mode : timing which returns a response
#    normal   -- return result of DB which received the query
#    reliable -- return result after waiting for response of
#                      all Cluster DBs.
#  o Use_Replication_Log : When this server hangs up without
#                               being replicated to the end,
#                               a remote server continues the
#                               replication using this log.
#    yes  --  use replication log
#    no   --  not use replication log
#  o Reserved_Connections : The number of reserved connections
#                                from this replication server
#                                to each cluster dbs.
#                                (default is 1).
#-------------------------------------------------------------
<Status_Log_File>  /usr/local/pgsql/log/pgreplicate.sts  </Status_Log_File>
<Error_Log_File>   /usr/local/pgsql/log/pgreplicate.log  </Error_Log_File>
<Replication_Port>       8001            </Replication_Port>
<Recovery_Port>          8101            </Recovery_Port>
<LifeCheck_Port>         8201            </LifeCheck_Port>

以上です。