[hackers-jp: 115] Fw: [HACKERS] Tuple sampling

2004年 4月 23日 (金) 17:31:35 JST

谷田です。分科会終わったあたりから読む暇がなかった、未読状態の山を処理し
ているところです。

Fw:したのは、新しいsamplingに関するproposalですね。ざっとテスト結果を見
た限りでは、今までより若干遅くなったが、ずいぶんと予測結果が向上している
ようですねぇ。これは楽しみです。

Forwarded by TANIDA Yutaka <tanida ＠ sra.co.jp>
----------------------- Original Message -----------------------
 From:    Manfred Koizar <mkoi-pg ＠ aon.at>
 To:      pgsql-hackers ＠ postgresql.org
 Date:    Mon, 19 Apr 2004 12:30:45 +0200
 Subject: [HACKERS] Tuple sampling
----

The proposed new sampling method
(http://archives.postgresql.org/pgsql-hackers/2004-04/msg00036.php and
http://archives.postgresql.org/pgsql-patches/2004-04/msg00045.php)
basically incorporates two independant changes:

(1) Two-stage sampling:  Stage one collects a random sample of pages,
stage two collects a random sample of tuples out of these pages.

(2) Improved row count estimation:  Once a page is read in, all tuples
on this page are looked at to see whether they are active (visible,
live) or dead.  The estimated row count is calculated from the number of
active tuples seen, the number of visited pages, and the total number of
pages.

The preview patch implements three sampling methods:  Method 0 is the
original sampling method;  method 1 uses both changes;  and method 2 is
sort of change (1) without change (2).

Theoretically we could apply only the second change, if we come to the
conclusion that we do not want two-stage sampling.  But I don't have
test results for this.

I'd like to present my *interpretation* of the results I got from
testing various sampling methods.  Whoever is interested in looking at
the raw test results can get them from
http://www.pivot.at/pg/SamplePerf2.sxc or (for those still without OOo)
http://www.pivot.at/pg/SamplePerf2.html.

This set of tests measures only the sampling times, actually analysing
the sample and storing the results in pg_statistic has been disabled.
To eliminate as many sources of noise as possible, following steps have
been taken prior to each test run:

  . SELECT count(*) FROM <testtable>;  
  to make sure that all visibility bits in the tuple headers are set
  correctly,

  . tar cf /dev/null <something>
  to fill the OS cache with unrelated stuff,

  . SELECT count(*) FROM <othertable>;
  to fill the shared buffers with unrelated data,

  . CHECKPOINT;
  to avoid a checkpoint happening during the test run.

The tests have been performed with two types of tables.  The tables
named x, x2, x3, x3d have initially been copied from tenk1 in the
regression database.  These tables have between 20 and 30 tuples per
page.  Tables of the other type -- named y, y3, y3d -- have much smaller
tuples, ca. 150 per page.

Method 1 vs. method 2:

With large tuples method 2 is sometimes faster, sometimes slower than
method 1, never enough to worry about.  This has been tested for
statistics targets 10 and 100.

Small tuples, statistics target 10 to 100:  No reproduceable difference
up to 490 pages (80000 tuples).  Starting at 1900 pages (320000 tuples)
method 2 is consistently faster.  The difference can easily be explained
by hundreds of thousands or even millions of heap_fetch() calls.

OTOH with method 2 we get row count estimation errors of up to 20%
compared to less than 1% for method 1.  So I conclude that change (2) is
well worth the CPU cycles it costs.

Method 0 produced estimation errors of up to 60%.

What about two-stage sampling?

Comparing the speeds of method 0 and method 2 we get the following
pattern which is basically the same for all statistics targets I tried.

Sample size 3000 (30000):
For tables smaller than 1000 (5000) pages all sampling methods access
all or almost all pages and there is hardly any runtime difference.  For
table sizes between 1000 (5000) and 3000 (30000) the new method reads
the whole table while the old method starts to skip pages.  This doesn't
result in a significant runtime difference, however.  If the table size
grows beyond 3000 (30000) pages, the number of page reads continues to
grow only for the old method, but up to ca. 30000 (300000) pages the new
method is not much faster (if at all).  For tables larger than 30000
(300000) pages two-stage sampling is reliably faster.

Other pros and cons of two-stage sampling

Pro and con:  Cache pollution.  Two-stage sampling reads the same number
of pages as one-stage sampling for small tables, slightly more for
medium sized tables (worst case is ca. 15%), and a little less to far
less for large to very large tables (3000 vs. 20000 for 430000 pages
with statistics target 10,  30000 vs. 98000 for the same table with
statistsitcs target 100).

Con:  In Tom's words "There's still a risk of not being able to collect
N rows out of N blocks, if you are unfortunate enough to select a lot of
wholly-empty pages.  But that seems like a low-probability scenario;
besides such a table would be so desperately in need of VACUUM FULL that
the possible low quality of the stats hardly matters."

Con:  The sample generated is not perfectly random (cf. discussion
starting at
http://archives.postgresql.org/pgsql-performance/2004-04/msg00196.php).
If somebody has an idea how we could steer against that effect of
collecting tuples from too few different pages, I'd be happy to
implement it.  If nothing pops up, I think that the current consensus is
that we don't care.

Anything else?

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

--------------------- Original Message Ends --------------------

-- 
TANIDA Yutaka <tanida ＠ sra.co.jp>