sampling in update statistics

Discussion:

(too old to reply)

unknown

2010-01-28 13:45:46 UTC

Hi,
I have some hundreds millions tables and update statistics
takes long time (days). I would like to implement 'update
statistics with sampling = x percent' but I have some
questions:
1) what is the difference between x=100% and normal 'update
statistics' (is it the same?)
2) what does it mean if x=0% (no statistics?)
3) after the command is executed where I can find the
information about the value x used in the command?
4) giving the number of records what is a possible rule to
decide the value of x?

mpeppler@peppler.org [Team Sybase]

2010-01-28 16:14:47 UTC

Permalink

Sampling only applies to the "update index statistics" command, and
only applies to the non-leading columns of indexes.

The leading column is already sorted, so the distribution can be
computed quickly by a single traversal of the index. However, for non-
leading columns you need to gather the data and sort it to generate
the histograms - this is where the sampling will reduce run time and
reduce the amount of tempdb space that is needed.

For your point 4) - the answer is "it depends". I've used various
rules that I've mostly made up of thin air, and just thought made kind
of sense with respect to the size of the databases/servers involved.

Michael

Post by unknown
Hi,
I have some hundreds millions tables and update statistics
takes long time (days). I would like to implement 'update
statistics with sampling = x percent' but I have some
1) what is the difference between x=100% and normal 'update
statistics' (is it the same?)
2) what does it mean if x=0% (no statistics?)
3) after the command is executed where I can find the
information about the value x used in the command?
4) giving the number of records what is a possible rule to
decide the value of x?

Sherlock, Kevin [TeamSybase]

2010-01-28 22:23:42 UTC

Permalink

Actually,
Sampling only applies to update stat commands that use column level syntax
like:

update statistics mytable (mycol) with sampling = 20 percent

Even if "mycol" is a column anywhere (leading or otherwise) in any index.
Otherwise, the "with sampling" clause is ignored. Such as:

-- only one index mytable.myindex has "mycol" only
-- sampling ignored here
update index statistics mytable with sampling = 20 percent

-- also ignored here
update statistics mytable with sampling = 20 percent

there is probably all kinds of other exceptions with regards to
partitioning, but this is a general truth.

Post by unknown
1) what is the difference between x=100% and normal 'update

statistics' (is it the same?)

Yes, the same effect

Post by unknown
2) what does it mean if x=0% (no statistics?)

No, it's the same as sampling = 100 percent

Post by unknown
3) after the command is executed where I can find the
information about the value x used in the command?

optdiag output will display this for the column as:
"Sampling Percent: nn"

Post by unknown
4) giving the number of records what is a possible rule to
decide the value of x?

As Michael mentioned, there is no such rule. It's the lowest number
possible such that you get adequate query plans generated to meet your
performance goals. Note, that (especially for non-all-pages locked tables)
sampling rates requested don't always directly relate to have many
rows/pages are actually sampled during the execution of update stats. I've
posted about this topic before a while back and there have been some
adjustements made in the algorithms inside ASE since, so this depends on
your version. I'll try to find that post and repost here.

"***@peppler.org [Team Sybase]" <***@gmail.com> wrote in
message news:9cbcd5c8-5ceb-4060-835e-***@m25g2000yqc.googlegroups.com...
Sampling only applies to the "update index statistics" command, and
only applies to the non-leading columns of indexes.

The leading column is already sorted, so the distribution can be
computed quickly by a single traversal of the index. However, for non-
leading columns you need to gather the data and sort it to generate
the histograms - this is where the sampling will reduce run time and
reduce the amount of tempdb space that is needed.

For your point 4) - the answer is "it depends". I've used various
rules that I've mostly made up of thin air, and just thought made kind
of sense with respect to the size of the databases/servers involved.

Michael