[pgcluster: 978] Re: フェイルオーバのテストで pglbが落ちたり、クラスタを見失ったり（？）します

2007年 4月 5日 (木) 12:19:22 JST

こんにちは、ナカノです。

レスありがとうございます＞三谷様

sslは使用していません。
念のためsslオプションを外してリコンフィグ・リコンパイルしてみましたが現象変わりませんでした。
confファイル内のホスト名を全てIPに書き換えても変わらず。

最悪コネクションプーリングを捨てても、と思いプーリングoffでフェイルオーバをしてみたところ、1つ目のクラスタを落とした後、一回だけ２つめのクラスタサーバに接続成功、それ以降は全て0 ClusterDB can be usedとなり、アクセスしなくなってしまいます。
１つめのクラスタを落とした後のpglbのログは以下のようになっていました。
2007-04-05 11:49:46 [11707] DEBUG:set_recovery():received no:101
2007-04-05 11:49:46 [11707] DEBUG:PGRset_status_on_cluster_tbl():host:172.xx.xx.xxx port:5440 max:2 use:2 status99
                                 　　　　　　　　　　　　　　----- psqlでpglbにアクセスしてみたら成功。
                                                                ----- select文実行成功
2007-04-05 11:50:18 [11706] DEBUG:PGRscan_cluster:1 ClusterDB can be used
2007-04-05 11:50:18 [11706] DEBUG:PGRscan_cluster:172.24.178.193 [5440],useFlag->99 max->2 use_num->0
2007-04-05 11:50:18 [11706] DEBUG:PGRscan_cluster:172.24.178.193 [5441],useFlag->1 max->2 use_num->0
2007-04-05 11:50:18 [11706] DEBUG:PGRset_status_on_cluster_tbl():host:172.xx.xx.xxx port:5441 max:2 use:1 status2
2007-04-05 11:50:18 [11730] DEBUG:PGRdo_child():I am 11730
2007-04-05 11:50:18 [11730] DEBUG:do_accept():I am 11730 accept fd 6
2007-04-05 11:50:18 [11730] DEBUG:read_startup_packet():Protocol Major: 3 Minor: 0 database: master user: postgres
(中略)
2007-04-05 11:51:58 [11730] DEBUG:Query(): select * from project;
2007-04-05 11:51:58 [11730] DEBUG:pool_process_query():read kind from backend T
2007-04-05 11:51:58 [11730] DEBUG:pool_process_query():read kind from backend pending data D len: 226 po: 795
2007-04-05 11:51:58 [11730] DEBUG:pool_process_query():read kind from backend pending data C len: 17 po: 1004
2007-04-05 11:51:58 [11730] DEBUG:pool_process_query():read kind from backend pending data Z len: 5 po: 1016
2007-04-05 11:51:58 [11730] DEBUG:ReadyForQuery(): message length: 5
2007-04-05 11:51:58 [11730] DEBUG:ReadyForQuery(): transaction state: I
                                             　　-----psql終了させたらそれ以降0 ClusterDB can be used
2007-04-05 11:52:01 [11730] DEBUG:ProcessFrontendResponse():read kind from frontend X(58)
2007-04-05 11:52:09 [11706] DEBUG:PGRscan_cluster:0 ClusterDB can be used
2007-04-05 11:52:09 [11706] DEBUG:PGRscan_cluster:172.xx.xx.xxx [5440],useFlag->99 max->2 use_num->0

2007-04-05 11:52:09 [11706] DEBUG:PGRscan_cluster:0 ClusterDB can be used
2007-04-05 11:52:09 [11706] DEBUG:PGRscan_cluster:172.24.178.193 [5440],useFlag->99 max->2 use_num->0

お手数おかけしてすいません。よろしくお願いします。


> -----Original Message-----
> From: a.mitani ＠ sra-europe.com
> Sent: Wed, Apr 4 2007 23:09:57 JST
> To: PGClusterメーリングリスト
> Subject: [pgcluster: 977] Re: フェイルオーバのテストで pglbが落ちたり、クラスタを見失ったり（？）します
> 
> こんにちは．
> 
> ロードバランサはSSLに対応していないのですが，
> ひょっとしてSSL通信されていたりしますか？
> 
> 三谷＠オランダ
> 
> 
> 
> > こんばんは、初めて投稿させていただきます。ナカノと申します。
> > 社内環境でPGClusterを使用したく動作検証を行っていますが、クラスタフェイルオー
> > バのテストでロードバランサが思ったように動作せずに困っています。
> > 当方の設定がおかしいのだろうと思いますが、どなたかご教授いただけないでしょう
> > か。
> >
> > 動作環境)
> > OS: RedHat Enterprise Linux WS4
> > PGCluster Ver.:1.5.0rc16
> >
> > config時のオプション)
> > $ configure --enable-thread-safety --enable-nls=ja
> > --enable-multibyte=UNICODE --with-perl --with-python --with-tcl
> > --with-openssl
> >
> > ホストを３つ用意し、１つにpglb、１つにpgreplicate、残り1つでクラスタサーバを
> > ２つ動かします。
> >    hostA             hostB            hostC
> > ------------   ----------------   -------------
> > |          |   |              |   |           |
> > |          |   | ------------ |   |           |
> > |          |---| |portmaster| |---|           |
> > |   pglb   |   | ------------ |   |pgreplicate|
> > |(port5432)|   |  (port5440)  |   |           |
> > |          |   |              |   |           |
> > |          |   | ------------ |   |           |
> > |          |---| |portmaster| |---|           |
> > |          |   | ------------ |   |           |
> > |          |   |  (port5441)  |   |           |
> > |          |   |              |   |           |
> > ------------   ----------------   -------------
> >
> > 起動手順)
> > クラスタ2つ->レプリケータ->バランサの順で起動します。
> >   hostB% pg_ctl start -D /usr/local/pgsql/data
> >   hostB% pg_ctl start -D /usr/local/pgsql/data2
> >   hostC$ pgreplicate -D /usr/local/pgsql/etc -l
> >   hostA# pglb -D /usr/local/pgsql/etc -n -v -l
> >
> > psqlでhostA上のpglbにアクセスし、参照/更新が正常に行なわれるを確認。
> >
> >
> > 現象1)
> > pglbをコネクションプーリングonにした状態で、pglb.conf/pgreplicate.conf で最
> > 初に記述してあるクラスタサーバ(5440 ＠ hostB)を落すと、数分後にhostA上のpglbが
> > 落ちてしまいます。
> > pglb(デバッグモード)のログ：
> > 2007-04-03 21:08:49 [27163]
> > DEBUG:PGRset_status_on_cluster_tbl():host:hostB port:5440 max:2 use:0
> > status1
> > 2007-04-03 21:08:49 [27163]
> > DEBUG:PGRset_status_on_cluster_tbl():host:hostB port:5441 max:2 use:0
> > status1
> > 2007-04-03 21:08:49 [27163] DEBUG:init_pglb():Child_Tbl size is[144]
> > 2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child
> > [5440 ＠ hostB]
> > 2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child
> > [5440 ＠ hostB]
> > 2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child
> > [5441 ＠ hostB]
> > 2007-04-03 21:08:49 [27163] DEBUG:PGRcreate_child():create child
> > [5441 ＠ hostB]
> >         --- ここで 5440 ＠ hostBに対して pg_ctl stop -D /usr/local/pgsql/data -
> > m i
> > 2007-04-03 21:09:56 [27164] DEBUG:set_recovery():received no:101
> > 2007-04-03 21:09:56 [27164]
> > DEBUG:PGRset_status_on_cluster_tbl():host:hp8193 port:5440 max:2 use:1
> > status99
> > 2007-04-03 21:09:56 [27163] ERROR:scan_cluster_by_pid():pid:27164 not
> > found in child table
> > 2007-04-03 21:09:56 [27163] ERROR:scan_cluster_by_pid():pid:27166 not
> > found in child table
> > ----ここで落ちる。
> > このときのpgreplicateのDEBUGログ：
> > 2007-04-03 21:09:56 [9807] ERROR:PGRcreateConn():Retry. h_errno is
> > 0,reason is 'could not connect to server: Connection refused
> >         Is the server running on host "172.xx.xx.xxx" and accepting
> >         TCP/IP connections on port 5440?
> > '
> > 2007-04-03 21:09:56 [9807] ERROR:PGRcreateConn():Retry. h_errno is
> > 0,reason is 'could not connect to server: Connection refused
> >         Is the server running on host "172.xx.xx.xxx" and accepting
> >         TCP/IP connections on port 5440?
> > '
> > 2007-04-03 21:09:56 [9807] DEBUG:PGRsend_load_balance_packet():host[hostA]
> > port[6001]
> >
> > pglbを再起動した後、5440 ＠ hostB をリカバリモードで起動するとrsync による同期
> > を取った後、通常運用可能となります。
> > 2つめのクラスタ(5441 ＠ hostB)を落した時はバランサは落ちません。
> > １つめのクラスタが落ちてもバランサが落ちずに2つ目のクラスタで運用が続くよう
> > にしたいです。
> >
> >
> > 現象2)
> > pglbをコネクションプーリングoffにして、現象1)と同様、pglb.conf/pgreplicate.c
> > onf で最初に記述してあるクラスタサーバ(5440 ＠ hostB)を落した後psqlでpglbにアク
> > セスすると、以下の様なエラーが出てしまい、運用を続けることができません。
> > psql: server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.
> > pglbをデバッグモードで起動してみると、クラスタサーバが落ちたことを認識した後
> > も常に同じクラスタ(5440 ＠ hostB)に対してのみアクセスしているようです。
> > 2007-04-03 20:10:28 [7957] DEBUG:PGRscan_cluster:0 ClusterDB can be used
> > 2007-04-03 20:10:28 [7957] DEBUG:PGRscan_cluster:hostB [5440],useFlag->2
> > max->2 use_num->0
> > 2007-04-03 20:10:28 [8061] DEBUG:PGRdo_child():I am 8061
> > 2007-04-03 20:10:28 [8061] DEBUG:do_accept():I am 8061 accept fd 6
> > 2007-04-03 20:10:28 [8061] DEBUG:read_startup_packet():Protocol Major:
> > 1234
> > Minor: 5679 database:  user:
> > 2007-04-03 20:10:28 [8061] DEBUG:PGRdo_child():SSLRequest: sent N; retry
> > startup2007-04-04 20:10:28 [8061] DEBUG:read_startup_packet():Protocol
> > Major:
> > 3 Minor: 0 database: master user: postgres
> > 2007-04-03 20:10:28 [8061] ERROR:connect_inet_domain_socket(): connect()
> > failed: Connection refused
> > 2007-04-03 20:10:28 [8061] DEBUG:PGRset_status_on_cluster_tbl():host:hostB
> > port:5440 max:2 use:2 status98
> > 2007-04-03 20:10:40 [7957] ERROR:load_balance_main():all clusters were
> > dead.
> > で、ログをよくみると、
> > "0 ClusterDB can be used"とか"all clusters were dead." 等、
> > pglbは全てのクラスタが落ちているとかんちがいしているようです。
> > (2つめのクラスタサーバが生きているのはpsqlで直接アクセスして確認。)
> >
> > 同一ホスト上で複数のクラスタを動かしているのが原因かと思い、2つ目のクラスタ
> > を新たにhostDを用意して移してみましたが2ケースとも同じ現象になります。
> > confファイル内のホスト名は、ドメインなしのホスト名のみ/FQDNどちらでやっても
> > 現象変わらずです。
> >
> >
> > hostA の pglb.conf ---
> > <Cluster_Server_Info>
> >     <Host_Name>   hostB </Host_Name>
> >     <Port>        5440  </Port>
> >     <Max_Connect> 2     </Max_Connect>
> > </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >     <Host_Name>   hostB </Host_Name>
> >     <Port>        5441  </Port>
> >     <Max_Connect> 2     </Max_Connect>
> > </Cluster_Server_Info>
> > <Host_Name>   hostA  </Host_Name>
> > <Backend_Socket_Dir>  /tmp </Backend_Socket_Dir>
> > <Receive_Port>        5432 </Receive_Port>
> > <Recovery_Port>       6001 </Recovery_Port>
> > <Max_Cluster_Num>     3    </Max_Cluster_Num>
> > <Use_Connection_Pooling> no </Use_Connection_Pooling>  #現象１のときはここを
> > yes
> > <LifeCheck_Timeout>      3s </LifeCheck_Timeout>
> > <LifeCheck_Interval>    15s </LifeCheck_Interval>
> > <Log_File_Info>
> >         <File_Name> /usr/local/pgsql/log/pglb.log </File_Name>
> >         <File_Size> 1M </File_Size>
> >         <Rotate> 3 </Rotate>
> > </Log_File_Info>
> >
> >
> > 5440 ＠ hostB  のcluster.conf -----
> > <Replicate_Server_Info>
> >         <Host_Name> hostC </Host_Name>
> >         <Port> 8001 </Port>
> >         <Recovery_Port> 8101 </Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name> hostB </Host_Name>
> > <Recovery_Port> 7040 </Recovery_Port>
> > <Rsync_Path> /usr/bin/rsync </Rsync_Path>
> > <Rsync_Option> ssh -1 </Rsync_Option>
> > <Rsync_Compress> yes </Rsync_Compress>
> > <Pg_Dump_Path> /usr/local/pgsql/bin/pg_dump
> > </Pg_Dump_Path>
> > <When_Stand_Alone> read_only </When_Stand_Alone>
> > <Replication_Timeout>   1min </Replication_Timeout>
> > <LifeCheck_Timeout> 3s </LifeCheck_Timeout>
> > <LifeCheck_Interval> 11s </LifeCheck_Interval>
> >
> >
> > 5441 ＠ hostB  のcluster.conf -----
> > <Replicate_Server_Info>
> >         <Host_Name> hostC </Host_Name>
> >         <Port> 8001 </Port>
> >         <Recovery_Port> 8101 </Recovery_Port>
> > </Replicate_Server_Info>
> > <Host_Name> hostB </Host_Name>
> > <Recovery_Port> 7040 </Recovery_Port>
> > <Rsync_Path> /usr/bin/rsync </Rsync_Path>
> > <Rsync_Option> ssh -1 </Rsync_Option>
> > <Rsync_Compress> yes </Rsync_Compress>
> > <Pg_Dump_Path> /usr/local/pgsql/bin/pg_dump
> > </Pg_Dump_Path>
> > <When_Stand_Alone> read_only </When_Stand_Alone>
> > <Replication_Timeout>   1min </Replication_Timeout>
> > <LifeCheck_Timeout> 3s </LifeCheck_Timeout>
> > <LifeCheck_Interval> 11s </LifeCheck_Interval>
> >
> >
> > hostC の pgreplicate.conf ---
> > <Cluster_Server_Info>
> >     <Host_Name>     hostB </Host_Name>
> >     <Port>          5440  </Port>
> >     <Recovery_Port> 7040  </Recovery_Port>
> > </Cluster_Server_Info>
> > <Cluster_Server_Info>
> >     <Host_Name>     hostB </Host_Name>
> >     <Port>          5441  </Port>
> >     <Recovery_Port> 7041  </Recovery_Port>
> > </Cluster_Server_Info>
> > <LoadBalance_Server_Info>
> >         <Host_Name>     hostA </Host_Name>
> >         <Recovery_Port> 6001  </Recovery_Port>
> > </LoadBalance_Server_Info>
> > <Host_Name> hostC </Host_Name>
> > <Replication_Port>              8001            </Replication_Port>
> > <Recovery_Port>                 8101            </Recovery_Port>
> > <RLOG_Port>                     8301            </RLOG_Port>
> > <Response_Mode>                 normal          </Response_Mode>
> > <Use_Replication_Log>   no                      </Use_Replication_Log>
> > <Replication_Timeout>   1min                    </Replication_Timeout>
> > <LifeCheck_Timeout>     3s                      </LifeCheck_Timeout>
> > <LifeCheck_Interval>    15s                     </LifeCheck_Interval>
> > <Log_File_Info>
> >         <File_Name> /usr/local/pgsql/log/pgreplicate.log </File_Name>
> >         <File_Size> 1M </File_Size>
> >         <Rotate> 3 </Rotate>
> > </Log_File_Info>
> >
> > お手数おかけしますがよろしくお願いします。
> >
>