[hackers-jp: 197] [Fwd: Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes]

Hiroki Kataoka kataoka @ interwiz.jp
2005年 7月 25日 (月) 16:27:21 JST


片岡です。

 下記メールの返答はどんな感じになるんでしょう。楽しみです。

 ところで、このO_DIRECTパッチはLinux Kernel 2.6専用になるんでしょうか?
 オリジナルのパッチ+Linux Kernel 2.4で試したらダメだったので。
 gather-writeはO_DIRECT以外でも効果あるんでしょうか?
 質問君になってしまいました。

-------- Original Message --------
Subject: Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Date: Sat, 23 Jul 2005 13:32:30 -0400 (EDT)
From: Bruce Momjian <pgman @ candle.pha.pa.us>
To: ITAGAKI Takahiro <itagaki.takahiro @ lab.ntt.co.jp>
CC: Tom Lane <tgl @ sss.pgh.pa.us>,        PostgreSQL-patches
<pgsql-patches @ postgresql.org>


I have modified and attached your patch for your review.  I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache.  If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing.  This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.

What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases.  And of course,
if your operating system doens't support O_DIRECT, it isn't used.

With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC.  Perhaps some folks can run testes once the patch is
applied.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Tom Lane <tgl @ sss.pgh.pa.us> wrote:
> 
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.
> 
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
> 
> In the current source, WAL is written as:
>     for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
>    write(&buffers[0], N * BLCKSZ);
> 
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
> 
> 
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
> 
> 
>             | writeback | fsync= | fdata | open_ | fsync_ | open_ 
> patch       | cache     |  false |  sync |  sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2 
> direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5 
> gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A) 
> both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2 
> 
> - 20runs * pgbench -s 100 -c 50 -t 200
>    - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
>    - hda(reiserfs) includes system and wal.
>    - hdc(jfs) includes database files. writeback-cache is always on.
> 
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
> 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman @ candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073


-- 
Hiroki Kataoka <kataoka @ interwiz.jp>



hackers-jp メーリングリストの案内