[pgsql-jp: 41319] Re: ストリーミングレプリケーションのフェイルバック方法

2013年 1月 11日 (金) 14:50:53 JST

池田です。

三谷さん、ご返信ありがとうございます。
recovery.confは以下の通りです。

[primary] * 再起動時
standby_mode = 'on'
primary_conninfo = 'host=localhost port=5433 user=replicator
application_name=primary'
restore_command = 'cp -a "/var/lib/pgsql/standby/data/pg_xlog/%f" "%p"'
recovery_target_timeline = 'latest'

[standby]
standby_mode = 'on'
primary_conninfo = 'host=localhost port=5432 user=replicator
application_name=standby'

この状態でprimaryへstandbyからフェイルバックを実施しようとすると、primaryへのpromote発行時は特にエラーにはならず、
最終的にはプロセスは以下のような状態になります。

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres   870  0.0  0.8 217736  9140 ?        S    23:28   0:00
/usr/pgsql-9.2/bin/postmaster -p 5432 -D
/var/lib/pgsql/primary/data（現在のスタンバイ）
postgres   876  0.0  0.1 175292  1276 ?        Ss   23:28   0:00  \_
postgres: logger process
postgres   877 46.0  0.2 217820  2136 ?        Rs   23:28   6:23  \_
postgres: startup process   recovering 000000020000000000000004
postgres   882  0.0  0.1 217736  1916 ?        Ss   23:28   0:00  \_
postgres: checkpointer process
postgres   883  0.0  0.1 217736  1748 ?        Ss   23:28   0:00  \_
postgres: writer process
postgres   885  0.0  0.1 177524  1488 ?        Ss   23:28   0:00  \_
postgres: stats collector process
postgres  1057  0.0  0.2 218556  2816 ?        Ss   23:36   0:00  \_
postgres: wal sender process replicator 127.0.0.1(38301) streaming
0/4000080
postgres  1062  0.0  0.2 224404  2996 ?        Ss   23:36   0:00  \_
postgres: wal receiver process
postgres  1041  0.0  0.8 217740  9144 ?        S    23:36   0:00
/usr/pgsql-9.2/bin/postmaster -p 5433 -D
/var/lib/pgsql/standby/data（現在のプライマリ）
postgres  1047  0.0  0.1 175296  1280 ?        Ss   23:36   0:00  \_
postgres: logger process
postgres  1048  0.0  0.2 217824  2132 ?        Ss   23:36   0:00  \_
postgres: startup process   recovering 000000020000000000000004
postgres  1052  0.0  0.1 217740  1556 ?        Ss   23:36   0:00  \_
postgres: checkpointer process
postgres  1053  0.0  0.1 217740  1752 ?        Ss   23:36   0:00  \_
postgres: writer process
postgres  1055  0.0  0.1 177528  1492 ?        Ss   23:36   0:00  \_
postgres: stats collector process
postgres  1056  0.0  0.2 224408  2984 ?        Ss   23:36   0:00  \_
postgres: wal receiver process
postgres  1063  0.0  0.2 218560  2804 ?        Ss   23:36   0:00  \_
postgres: wal sender process replicator 127.0.0.1(38339) streaming
0/4000080

旧プライマリのpid:877(startup process)のcpu使用率がガンガン上がっていく感じです。
このとき、pg_stat_replicationで状態を確認しようとすると、primary, standbyともに以下のエラーで終了してしまいます。
--
ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.
--

参考になりますでしょうか。

2013年1月11日 11:48 mitani <mitani ＠ sraw.co.jp>:
> 池田さん、こんにちは。
> 三谷＠広島です。
>
> 念のため、primaryとstandbyのrecovery.confを
> それぞれ開示してもらうことは可能でしょうか。
>
>
> On Fri, 11 Jan 2013 03:10:13 +0900
> 池田亘 <ikeda.wataru ＠ gmail.com> wrote:
>
>> 池田と申します。質問させて下さい。
>> 同期レプリケーションの検証中なのですが、フェイルオーバーが完了した後、同じ手順でフェイルバックを実施しようとするとスタンバイに pg_ctl
>> promote を発行してもプライマリに昇格されません。startup process は recovering xxxx
>> となったままです。
>> 以下の手順でフェイルバックはできないのでしょうか。
>> 抜けている部分、代替手順などあればご教授ください。
>>
>>
>> * 検証環境
>> CentOS release 6.3 x86_64
>> PostgreSQL 9.2
>>  primary(-p 5432), standby(-p 5433)とも同一ホストで稼働。
>>
>>  - postgresql.conf
>>    --
>>    wal_level = hot_standby
>>    archive_mode = off
>>    max_wal_senders = 2
>>    wal_keep_segments = 32
>>    synchronous_standby_names = '*'
>>    hot_standby = on
>>
>> - recovery.conf
>>    --
>>    standby_mode = 'on'
>>    primary_conninfo = 'host=localhost port=xxxx user=replicator
>> application_name=xxxx'
>>    restore_command = 'cp -a "/var/lib/pgsql/xxxx/data/pg_xlog/%f" "%p"'
>>    recovery_target_timeline = 'latest'
>>
>> * フェイルオーバー手順
>>  0. primary（マスタ） と standby で同期レプリケーション。
>>  1. primary を疑似クラッシュ。(immediateで停止。)
>>  2. standby に promote を発行。startup process は waitng、recovery.conf が
>> recovery.done に置き換わったことを確認。この時のログは以下の通り。
>>      --
>>      LOG:  received promote request
>>      LOG:  redo done at 0/4000058
>>      LOG:  selected new timeline ID: 2
>>      LOG:  archive recovery complete
>>  3. primary に recovery.conf を設置して再起動。
>>      --
>>      LOG:  streaming replication successfully connected to primary
>>      LOG:  standby "primary" is now the synchronous standby with priority 1
>>  4. standby がマスタに昇格して同期レプリケーションが継続していることを確認。
>>
>> * フェイルバック手順
>>  0. standby（マスタ） と primary で同期レプリケーション。
>>  1. stanby を停止。
>>  2. primary に promote を発行。startup process は recovering
>> のまま。recovery.done ファイルに置き換わらない。この時 primary のログは以下を繰り返す。
>>      --
>>      LOG:  received promote request
>>      LOG:  restored log file "000000020000000000000004" from archive
>>      LOG:  record with zero length at 0/4000118
>>      FATAL:  could not connect to the primary server: could not
>> connect to server: Connection refused
>>  3. standby に recovery.conf を設置して再起動。どちらもスタンバイモードになる。
>>      --
>>      [standby]
>>      LOG:  entering standby mode
>>      LOG:  restored log file "000000020000000000000004" from archive
>>      LOG:  consistent recovery state reached at 0/4000118
>>      LOG:  record with zero length at 0/4000118
>>      LOG:  record with zero length at 0/4000118
>>      LOG:  database system is ready to accept read only connections
>>      LOG:  streaming replication successfully connected to primary
>>      [primary]
>>      LOG:  streaming replication successfully connected to primary
>>  4. いずれもマスタ？かつスタンバイモードになっていることを確認。
>>
>> __________  ESET NOD32 Antivirus からの情報, ウイルス定義データベースのバージョン 7881 (20130110) __________
>>
>> このメッセージは ESET NOD32 Antivirus によって検査済みです。
>>
>> http://canon-its.jp
>>
>>
>>
>
>
> --
> mitani <mitani ＠ sraw.co.jp>