3

I'm creating benchmark tables to measure hourly load (query latency, query duration) on our Netezza box. I have a fair idea of the number of rows I want in the tables, and I'm trying to decide on the number of columns. There will be no indexes; I'll be running full table scans.

I'm trying to decide how many columns I need in the benchmark tables and had the following question: how would the number of columns (and their type) affect the performance of a count(*) query. My initial thought is that a table with more columns will be spread over more disk blocks. Consequently the system will have to make more disk seeks leading to a longer query.

While I'm working on Netezza, I welcome answers relevant to other systems as well (MySql, Postgres, Vertica etc) to help my general understanding.

There have been several discussions already (Q1, Q2, Q3) on the impact of column count on query performance. The questions talk about a generic query and not a full table scan without index. Hence the separate question.

Community
  • 1
  • 1
RDK
  • 923
  • 1
  • 9
  • 22

1 Answers1

22

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.

Why is that?

Every DBMS stores rows in blocks - typically 8k blocks, but not necessarily. Especially data warehouse systems tend to use larger block sizes. If a table has many columns with a lot of data in them (think varchar columns) this means that on less rows fit on a single database block.

For a transactional system that supports proper isolation a count(*) query must query all rows in a table (and check for each row if it's currently visible to your transaction). The smallest unit a DBMS reads from the harddisk is a block. So the more rows fit onto a block, the less I/O needs to be done.

If a rows takes 100 bytes in average a single block will contain approximately 80 rows. To count through all rows in a table with 80 rows, the database only needs to do a single I/O operation (in reality it's a bit more to look up the table itself).

Now if each row needs 1000 bytes, a single block contains approximately 8 rows, which in turn means to count all rows the DB needs to do 8 I/O operations.

Even if the data is cached, it's still 1 "logical" I/O vs. 8 "logical" I/O operations.

The above only holds true if no indexes are involved.

This effect can be observed with any DBMS that supports a detailed execution plan. Take this sample script for Postgres:

create table data5 (c1 text, c2 text, c3 text, c4 text, c5 text);
insert into data5 
select rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X')
from generate_series(1,100000);

create table data10 (c1 text, c2 text, c3 text, c4 text, c5 text, c6 text, c7 text, c8 text, c9 text, c10 text);
insert into data10 
select rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X'),
       rpad('X',50,'X')
from generate_series(1,100000);

The above creates two tables with 100.000 rows each. One with 5 columns, the other with 10.

When doing an explain analyze, the following will be returned (Postgres 9.3):

explain (analyze, buffers, verbose)
select count(*)
from data5;

Aggregate  (cost=4192.00..4192.01 rows=1 width=0) (actual time=27.539..27.539 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=2942
  ->  Seq Scan on stuff.data5  (cost=0.00..3942.00 rows=100000 width=0) (actual time=0.005..16.158 rows=100000 loops=1)
        Output: c1, c2, c3, c4, c5
        Buffers: shared hit=2942
Total runtime: 27.595 ms

The line Buffers: shared hit=2942 tells us that Postgres had to look at 2942 blocks in order to read through the entire table.

Now for the table with 10 columns:

explain (analyze, buffers, verbose)
select count(*)
from data10;

Aggregate  (cost=7917.00..7917.01 rows=1 width=0) (actual time=34.964..34.965 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=6667
  ->  Seq Scan on stuff.data10  (cost=0.00..7667.00 rows=100000 width=0) (actual time=0.010..22.187 rows=100000 loops=1)
        Output: c1, c2, c3, c4, c5, c6, c7, c8, c9, c10
        Buffers: shared hit=6667
Total runtime: 35.025 ms

We can see that Postgres had to look at 6667 blocks to get the count.

A similar thing can be observed using Oracle's SQL*Plus and the set autotrace statistics option that will also show the amount of (logical) I/O that was performed.

The statistics for the data5 table would look something like this:

VALUE | STATISTIC                             
------+---------------------------------------
  140 | bytes received via SQL*Net from client
  755 | bytes sent via SQL*Net to client      
 2977 | consistent gets                       
    0 | db block gets                         
    0 | physical reads                        
    0 | recursive calls                       
    0 | redo size                             
 2977 | session logical reads                 
    1 | sorts (memory)                        
    2 | SQL*Net roundtrips to/from client     
    1 | rows processed                        

The "consistent gets" denote the amount of logical I/O:

For the data10 table the output is as follows:

VALUE | STATISTIC                             
------+---------------------------------------
  141 | bytes received via SQL*Net from client
  615 | bytes sent via SQL*Net to client      
 7184 | consistent gets                       
    0 | db block gets                         
    0 | physical reads                        
    0 | recursive calls                       
    0 | redo size                             
 7184 | session logical reads                 
    1 | sorts (memory)                        
    2 | SQL*Net roundtrips to/from client     
    1 | rows processed                        

Again we clearly see the increase in (logical) I/O

  • Just as a footnote: this answer's explanation applies when using either `count(1)` or `count(*)`. It's the columns in the _table_ that matter, not the parameter provided to the aggregate function. – Allan Oct 24 '14 at 20:56
  • @Allan There is no different between `count(1)` or `count(*)`. The answer is also true for `count(some_column)` because when no index is present the DBMS will still have to read and inspect each and every row in the table (`count(some_column)` is equivalent to `count(*) where some_columns is not null`) –  Oct 24 '14 at 21:11
  • Yeah, that's exactly what I was saying. Given how widespread the myth about `count(*)` is, I thought it would be useful to add a quick note that changing the function would not result in an improvement in this case. – Allan Oct 24 '14 at 21:16