[pgcluster: 975] フェイルオーバのテストでpglbが落ちたり、クラスタを見失ったり（？）します

2007年 4月 4日 (水) 22:23:10 JST

こんばんは、初めて投稿させていただきます。ナカノと申します。
社内環境でPGClusterを使用したく動作検証を行っていますが、クラスタフェイルオーバのテストでロードバランサが思ったように動作せずに困っています。
当方の設定がおかしいのだろうと思いますが、どなたかご教授いただけないでしょうか。

動作環境)
OS: RedHat Enterprise Linux WS4
PGCluster Ver.:1.5.0rc16

config時のオプション)
$ configure --enable-thread-safety --enable-nls=ja --enable-multibyte=UNICODE --with-perl --with-python --with-tcl --with-openssl

ホストを３つ用意し、１つにpglb、１つにpgreplicate、残り1つでクラスタサーバを２つ動かします。
   hostA             hostB            hostC
------------   ----------------   -------------
|          |   |              |   |           |
|          |   | ------------ |   |           |
|          |---| |portmaster| |---|           |
|   pglb   |   | ------------ |   |pgreplicate|
|(port5432)|   |  (port5440)  |   |           |
|          |   |              |   |           |
|          |   | ------------ |   |           |
|          |---| |portmaster| |---|           |
|          |   | ------------ |   |           |
|          |   |  (port5441)  |   |           |
|          |   |              |   |           |
------------   ----------------   -------------

起動手順)
クラスタ2つ->レプリケータ->バランサの順で起動します。
  hostB% pg_ctl start -D /usr/local/pgsql/data
  hostB% pg_ctl start -D /usr/local/pgsql/data2
  hostC$ pgreplicate -D /usr/local/pgsql/etc -l
  hostA# pglb -D /usr/local/pgsql/etc -n -v -l

psqlでhostA上のpglbにアクセスし、参照/更新が正常に行なわれるを確認。


現象1)
pglbをコネクションプーリングonにした状態で、pglb.conf/pgreplicate.conf で最初に記述してあるクラスタサーバ(5440 ＠ hostB)を落すと、数分後にhostA上のpglbが落ちてしまいます。
pglb(デバッグモード)のログ：
2007-04-03 21:08:49 [27163] DEBUG:PGRset_status_on_cluster_tbl():host:hostB port:5440 max:2 use:0 status1
2007-04-03 21:08:49 [27163] DEBUG:PGRset_status_on_cluster_tbl():host:hostB port:5441 max:2 use:0 status1
2007-04-03 21:08:49 [27163] DEBUG:init_pglb():Child_Tbl size is[144]
2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child [5440 ＠ hostB]
2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child [5440 ＠ hostB]
2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child [5441 ＠ hostB]
2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child [5441 ＠ hostB]
        --- ここで 5440 ＠ hostBに対して pg_ctl stop -D /usr/local/pgsql/data -m i 
2007-04-03 21:09:56 [27164] DEBUG:set_recovery():received no:101
2007-04-03 21:09:56 [27164] DEBUG:PGRset_status_on_cluster_tbl():host:hp8193 port:5440 max:2 use:1 status99
2007-04-03 21:09:56 [27163] ERROR:scan_cluster_by_pid():pid:27164 not found in child table
2007-04-03 21:09:56 [27163] ERROR:scan_cluster_by_pid():pid:27166 not found in child table
----ここで落ちる。
このときのpgreplicateのDEBUGログ：
2007-04-03 21:09:56 [9807] ERROR:PGRcreateConn():Retry. h_errno is 0,reason is 'could not connect to server: Connection refused
        Is the server running on host "172.xx.xx.xxx" and accepting
        TCP/IP connections on port 5440?
'
2007-04-03 21:09:56 [9807] ERROR:PGRcreateConn():Retry. h_errno is 0,reason is 'could not connect to server: Connection refused
        Is the server running on host "172.xx.xx.xxx" and accepting
        TCP/IP connections on port 5440?
'
2007-04-03 21:09:56 [9807] DEBUG:PGRsend_load_balance_packet():host[hostA] port[6001]

pglbを再起動した後、5440 ＠ hostB をリカバリモードで起動するとrsync による同期を取った後、通常運用可能となります。
2つめのクラスタ(5441 ＠ hostB)を落した時はバランサは落ちません。
１つめのクラスタが落ちてもバランサが落ちずに2つ目のクラスタで運用が続くようにしたいです。


現象2)
pglbをコネクションプーリングoffにして、現象1)と同様、pglb.conf/pgreplicate.conf で最初に記述してあるクラスタサーバ(5440 ＠ hostB)を落した後psqlでpglbにアクセスすると、以下の様なエラーが出てしまい、運用を続けることができません。
psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
pglbをデバッグモードで起動してみると、クラスタサーバが落ちたことを認識した後も常に同じクラスタ(5440 ＠ hostB)に対してのみアクセスしているようです。
2007-04-03 20:10:28 [7957] DEBUG:PGRscan_cluster:0 ClusterDB can be used
2007-04-03 20:10:28 [7957] DEBUG:PGRscan_cluster:hostB [5440],useFlag->2
max->2 use_num->0
2007-04-03 20:10:28 [8061] DEBUG:PGRdo_child():I am 8061
2007-04-03 20:10:28 [8061] DEBUG:do_accept():I am 8061 accept fd 6
2007-04-03 20:10:28 [8061] DEBUG:read_startup_packet():Protocol Major: 1234
Minor: 5679 database:  user:
2007-04-03 20:10:28 [8061] DEBUG:PGRdo_child():SSLRequest: sent N; retry
startup2007-04-04 20:10:28 [8061] DEBUG:read_startup_packet():Protocol Major:
3 Minor: 0 database: master user: postgres
2007-04-03 20:10:28 [8061] ERROR:connect_inet_domain_socket(): connect()
failed: Connection refused
2007-04-03 20:10:28 [8061] DEBUG:PGRset_status_on_cluster_tbl():host:hostB
port:5440 max:2 use:2 status98
2007-04-03 20:10:40 [7957] ERROR:load_balance_main():all clusters were dead.
で、ログをよくみると、
"0 ClusterDB can be used"とか"all clusters were dead." 等、
pglbは全てのクラスタが落ちているとかんちがいしているようです。
(2つめのクラスタサーバが生きているのはpsqlで直接アクセスして確認。)

同一ホスト上で複数のクラスタを動かしているのが原因かと思い、2つ目のクラスタを新たにhostDを用意して移してみましたが2ケースとも同じ現象になります。
confファイル内のホスト名は、ドメインなしのホスト名のみ/FQDNどちらでやっても現象変わらずです。


hostA の pglb.conf ---
<Cluster_Server_Info>
    <Host_Name>   hostB </Host_Name>
    <Port>        5440  </Port>
    <Max_Connect> 2     </Max_Connect>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>   hostB </Host_Name>
    <Port>        5441  </Port>
    <Max_Connect> 2     </Max_Connect>
</Cluster_Server_Info>
<Host_Name>   hostA  </Host_Name>
<Backend_Socket_Dir>  /tmp </Backend_Socket_Dir>
<Receive_Port>        5432 </Receive_Port>
<Recovery_Port>       6001 </Recovery_Port>
<Max_Cluster_Num>     3    </Max_Cluster_Num>
<Use_Connection_Pooling> no </Use_Connection_Pooling>  #現象１のときはここをyes
<LifeCheck_Timeout>      3s </LifeCheck_Timeout>
<LifeCheck_Interval>    15s </LifeCheck_Interval>
<Log_File_Info>
        <File_Name> /usr/local/pgsql/log/pglb.log </File_Name>
        <File_Size> 1M </File_Size>
        <Rotate> 3 </Rotate>
</Log_File_Info>


5440 ＠ hostB  のcluster.conf -----
<Replicate_Server_Info>
        <Host_Name> hostC </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Host_Name> hostB </Host_Name>
<Recovery_Port> 7040 </Recovery_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh -1 </Rsync_Option>
<Rsync_Compress> yes </Rsync_Compress>
<Pg_Dump_Path> /usr/local/pgsql/bin/pg_dump
</Pg_Dump_Path>
<When_Stand_Alone> read_only </When_Stand_Alone>
<Replication_Timeout>   1min </Replication_Timeout>
<LifeCheck_Timeout> 3s </LifeCheck_Timeout>
<LifeCheck_Interval> 11s </LifeCheck_Interval>


5441 ＠ hostB  のcluster.conf -----
<Replicate_Server_Info>
        <Host_Name> hostC </Host_Name>
        <Port> 8001 </Port>
        <Recovery_Port> 8101 </Recovery_Port>
</Replicate_Server_Info>
<Host_Name> hostB </Host_Name>
<Recovery_Port> 7040 </Recovery_Port>
<Rsync_Path> /usr/bin/rsync </Rsync_Path>
<Rsync_Option> ssh -1 </Rsync_Option>
<Rsync_Compress> yes </Rsync_Compress>
<Pg_Dump_Path> /usr/local/pgsql/bin/pg_dump
</Pg_Dump_Path>
<When_Stand_Alone> read_only </When_Stand_Alone>
<Replication_Timeout>   1min </Replication_Timeout>
<LifeCheck_Timeout> 3s </LifeCheck_Timeout>
<LifeCheck_Interval> 11s </LifeCheck_Interval>


hostC の pgreplicate.conf ---
<Cluster_Server_Info>
    <Host_Name>     hostB </Host_Name>
    <Port>          5440  </Port>
    <Recovery_Port> 7040  </Recovery_Port>
</Cluster_Server_Info>
<Cluster_Server_Info>
    <Host_Name>     hostB </Host_Name>
    <Port>          5441  </Port>
    <Recovery_Port> 7041  </Recovery_Port>
</Cluster_Server_Info>
<LoadBalance_Server_Info>
        <Host_Name>     hostA </Host_Name>
        <Recovery_Port> 6001  </Recovery_Port>
</LoadBalance_Server_Info>
<Host_Name> hostC </Host_Name>
<Replication_Port>              8001            </Replication_Port>
<Recovery_Port>                 8101            </Recovery_Port>
<RLOG_Port>                     8301            </RLOG_Port>
<Response_Mode>                 normal          </Response_Mode>
<Use_Replication_Log>   no                      </Use_Replication_Log>
<Replication_Timeout>   1min                    </Replication_Timeout>
<LifeCheck_Timeout>     3s                      </LifeCheck_Timeout>
<LifeCheck_Interval>    15s                     </LifeCheck_Interval>
<Log_File_Info>
        <File_Name> /usr/local/pgsql/log/pgreplicate.log </File_Name>
        <File_Size> 1M </File_Size>
        <Rotate> 3 </Rotate>
</Log_File_Info>

お手数おかけしますがよろしくお願いします。