7

Imagine a table with the following structure on PostgreSQL 9.0:

create table raw_fact_table (text varchar(1000));

For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.

To increase performance I want to convert to the following star schema structure:

create table dimension_table (id int, text varchar(1000));

The fact table would then be replaced with a fact table like the following:

create table fact_table (dimension_table_id int);

My current method is to essentially run the following query to create the dimension table:

Create table dimension_table (id int, text varchar(1000), primary key(id));

then to create fill the dimension table I use:

insert into dimension_table (select null, text from raw_fact_table group by text);

Afterwards I need to run the following query:

select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);

Just imagine the horrible performance I get by comparing all strings to all other strings several times.

On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?

Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):

"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
David
  • 4,786
  • 11
  • 52
  • 80
  • How long does this seem to take? How long did you expect it to take? – Mike Sherrill 'Cat Recall' Jan 24 '11 at 12:12
  • I never finished it using the mentioned amount of data. But on 15 million rows it took several hours. I have already looked into all the standard server optimization stuff (work_mem etc.) so I am after a different method to achieve the same result. – David Jan 24 '11 at 13:35
  • I have edited the question to include sample data. – David Jan 25 '11 at 14:22

5 Answers5

7

Just imagine the horrible performance I get by comparing all strings to all other strings several times.

When you've been doing this a while, you stop imagining performance, and you start measuring it. "Premature optimization is the root of all evil."

What does "billion" mean to you? To me, in the USA, it means 1,000,000,000 (or 1e9). If that's also true for you, you're probably looking at between 1 and 7 terabytes of data.

My current method is to essentially run the following query to create the dimension table:

Create table dimension_table (id int, text varchar(1000), primary key(id));

How are you gonna fit 10 billion rows into a table that uses an integer for a primary key? Let's even say that half the rows are duplicates. How does that arithmetic work when you do it?

Don't imagine. Read first. Then test.

Read Data Warehousing with PostgreSQL. I suspect these presentation slides will give you some ideas.

Also read Populating a Database, and consider which suggestions to implement.

Test with a million (1e6) rows, following a "divide and conquer" process. That is, don't try to load a million at a time; write a procedure that breaks it up into smaller chunks. Run

EXPLAIN <sql statement>

You've said you estimate at least 99% duplicate rows. Broadly speaking, there are two ways to get rid of the dupes

  1. Inside a database, not necessarily the same platform you use for production.
  2. Outside a database, in the filesystem, not necessarily the same filesystem you use for production.

If you still have the text files that you loaded, I'd consider first trying outside the database. This awk one-liner will output unique lines from each file. It's relatively economical, in that it makes only one pass over the data.

awk '!arr[$0]++' file_with_dupes > file_without_dupes

If you really have 99% dupes, by the end of this process you should have reduced your 1 to 7 terabytes down to about 50 gigs. And, having done that, you can also number each unique line and create a tab-delimited file before copying it into the data warehouse. That's another one-liner:

awk '{printf("%d\t%s\n", NR, $0);}' file_without_dupes > tab_delimited_file

If you have to do this under Windows, I'd use Cygwin.

If you have to do this in a database, I'd try to avoid using your production database or your production server. But maybe I'm being too cautious. Moving several terabytes around is an expensive thing to do.

But I'd test

SELECT DISTINCT ...

before using GROUP BY. I might be able to do some tests on a large data set for you, but probably not this week. (I don't usually work with terabyte-sized files. It's kind of interesting. If you can wait.)

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • I AM getting horrible performance and I am asking for specific advise for solving the specific problem. The raw_fact_table does not have an integer for primary key. Only the dimension tables as there are 99.XX% duplicates in the fact_table. I have already implemented all the advice from the links you sent me. – David Jan 23 '11 at 23:58
  • The "T" in "ETL" is killing you. 99% duplicates means you're aiming for something around 100,000,000 rows. I'll edit my answer. – Mike Sherrill 'Cat Recall' Jan 24 '11 at 02:51
  • To simplify the example I only mentioned that the raw_fact_table has one text column. In fact it has a dozen, so your method for removing duplicates would not work, thanks for pointing this out, though. I will update the question. The raw_fact_table also has integer values and double values. – David Jan 24 '11 at 13:31
  • @David: The awk one-liner I wrote will still work. It treats the whole line as a key to an associative array. So it will output only unique lines, regardless of the number of columns. – Mike Sherrill 'Cat Recall' Jan 24 '11 at 13:37
  • But, there will be no unique lines at all. There are only duplicates within each column, it will never happen that there are whole lines that are duplicates. – David Jan 24 '11 at 14:42
  • Then we're back to "How are you gonna fit 10 billion rows into a table that uses an integer for a primary key?" – Mike Sherrill 'Cat Recall' Jan 24 '11 at 15:08
  • I do not have integer as primary key. Just in the dimension tables. Please read about the star schema to understand the concept of dimension tables. – David Jan 24 '11 at 22:31
2

You are omitting some details there at the end, but I don't see that there necessarily is a problem. It is not in evidence that all strings are actually compared to all other strings. If you do a join, PostgreSQL could very well pick a smarter join algorithm, such as a hash join, which might give you the same hashing that you are implementing yourself in your MySQL solution. (Again, your details are hazy on that.)

Peter Eisentraut
  • 35,221
  • 12
  • 85
  • 90
2
-- add unique index
CREATE UNIQUE INDEX uidx ON dimension_table USING hash(text);
-- for non case-sensitive hash(upper(text))

try hash(text); and btree(text) to see which one is faster

Bonshington
  • 3,970
  • 2
  • 25
  • 20
2

Just to questions: - it neccessary to convert your data in 1 or 2 steps? - May we modify the table while converting?

Running more simplier queries may improve your performance (and the server load while doing it)

One approach would be:

  1. generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
  2. repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...

Or another (probably better):

  1. create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
  2. create dimension_table
  3. create an after insert trigger for fact_table which:
    • searches for dimension_text in fact_table
    • if not found, creates a new record in dimension_table
    • updates dimension_id to this id
  4. in a simle loop, insert every record from raw_fact_table to fact_table
Dutow
  • 5,638
  • 1
  • 30
  • 40
  • Thank you for your suggestions. I did not specify this, but my only concern is to process all data as fast as possible, so running something in the background does not make sence in my setup (I know it is very wise in other situations). The problem with your second approach is that triggers are not triggered on COPY FROM. So I believe there is no reason for the trigger. Your approach is though still very much valid using a cursor. I am though uncertain about performance: http://stackoverflow.com/questions/4776127/postgres-surprising-performance-on-updates-using-cursor – David Jan 25 '11 at 13:00
  • With regards to your first approach, I have not timed the performance of the generating of the dimension table (I should do this). I like your approach with thinking the other way around. I will create a new comment with an approach based on yours. – David Jan 25 '11 at 13:31
  • create table dimension_table (id serial, text varchar(1000), raw_fact_table_id bigint[], primary key(id));------------------- insert into dimension_table (text, raw_fact_table_id) (select text, array_agg(raw_fact_table.id) from raw_fact_table group by text); afterwards one would need to find a way to update raw_fact_table based on the ids in raw_fact_table_id. What do you think? – David Jan 25 '11 at 13:40
  • It looks good. After this you can easily alter the original table (add columns, and update after that), or you can create a join table (you don't need to alter the original), which solves the problem. (using unnest) – Dutow Jan 25 '11 at 19:43
  • I tried this method now and it took around 7 times longer than the original method. The way I run the update is to fetch one row from the dimension table containing the array and then run update raw_fact_table set text = '3' where id in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17); The array is resolved to the numbers. Especially if there are few duplicates (say only 70%) in a column, then A LOT of update queries need to run. – David Jan 26 '11 at 10:26
  • but why don't you just create a join table with (dimesion_id, raw_fact_table_id) pairs? join's will be fast with that, and table generation should just take one sql from your current dimension table. – Dutow Jan 28 '11 at 11:50
1

I an see several ways of solving your problem There is md5 function in PostgreSql md5(string) Calculates the MD5 hash of string, returning the result in hexadecimal

insert into dimension_table (select null, md5(text), text from raw_fact_table group by text)

add md5 field into raw_fact_table as well select id into fact_table from dimension inner join raw_fact_table on (dimension.md5 = raw_fact_table.md5);

Indexes on MD5 filed might help as well

Or you can calculate MD5 on the fly while loading the data. For example our ETL tool Advanced ETL processor can do it for you. Plus it can load data into multiple tables same time.

There is a number of on-line tutorials available on our web site For example this one demonstrates loading slow changing dimension

http://www.dbsoftlab.com/online-tutorials/advanced-etl-processor/advanced-etl-processor-working-with-slow-changing-dimension-part-2.html

ETL Man
  • 245
  • 1
  • 2
  • 7
  • I do not believe one can run calculate MD5 while running COPY FROM (which is the recommended way to load data). If this means that your tool does not use COPY FROM, then I believe it is useless as loading without this would take ages. I have to say that I am very sceptical about a code-less ETL solution. It is fine as long as I just need to do the standard stuff, but if I ever run in to a special case problem, I have no code to fall back on. – David Jan 24 '11 at 22:35
  • Totally agree with you COPY FROM is the fastest way to load data into PostgreSQL. This is why we use it internally in Advanced ETL Processor. From PostgreSQL documentation: COPY TABLE_NAME FROM STDIN (STDIN Specifies that input comes from the client application.) – ETL Man Jan 25 '11 at 23:30
  • We did our best to make it as fast as possible. For every database we use the fasted way of loading the data. (Direct/conventional path for oracle, bcp for SQL Server, copy from for PostgreSQL, etc) We actually printed critical code and marked and eliminated all ineffective parts. Than we used profiler and optimised performance further. We are constantly making improvements. (Have a look at our support forum and note how long it takes to resolve the issue or introduce new feature than compare it to big players). – ETL Man Jan 25 '11 at 23:31
  • Our software is not bloated (Only 28M for enterprise version) It works directly with 16 data sources and 17 data targets. It does not use a lot of memory and memory usage is not related to the file/table size (Except OLE DB Source) Unlike SSIS/DTS/ODBC/OLEDB it works correctly with Excel all the time. It can use file system as data source and load files into blob fields. It can save blob fields into separate files It can process emails and use POP3 server as a data source It can send email and use SMTP server as data target. I can carry on but I think you get the picture. – ETL Man Jan 25 '11 at 23:32
  • ETL software is incredible difficult to write. I mean do you know anybody with the deep knowledge of low level API for MS Sql Server, Oracle, MySql, PostgreSQL, Firebird, Interbase, SQLite, POP3, SMTP, XML, Excel, Access, DBF, Foxpro, ODBC, OLE DB, FTP, SFTP, FTPS, HTTP, SSL? I do. – ETL Man Jan 25 '11 at 23:33
  • Good to know that you use copy from. What about my second concern? – David Jan 26 '11 at 07:14
  • Yes you can write our own transformations scripts if necessary – ETL Man Jan 26 '11 at 22:13