1

I've created a program that parses data from a file and imports it into a relational postgresql database. The program has been running for 2 weeks and looks like it has a few more days left. It is averaging ~150 imports a second. How can I find the limiting factor and make it go faster? The CPU for my program does not go above 10%, The Memory does not go above 7%. The Postgresql database CPU does not go above 10%, and 25% Memory.

I'm guessing that the limiting factor is the hard-disk write speed, but how can I verify, and if the case; improve it? (short of buying a faster hard drive)

This is the output of "iostat -d -x":

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.59     3.23    0.61    1.55    23.15    38.37    28.50     0.01    5.76   1.04   0.22
sdb               0.02   308.37   21.72  214.53   706.41  4183.68    20.70     0.56    2.38   2.24  52.89

As you can likely guess, the database is on sdb.

EDIT: The file I am parsing is ~7GB. For most (but not all) of the data in the file I go line by line, here is an example"

  1. Return the ID of partA in tableA.
    • If ID does not exist insert partA into tableA returning ID
  2. Return the ID of partB in tableB.
    • If ID does not exist insert partB into tableB returning ID
  3. Return the ID of the many-to-many relationship of partA and partB.
    • If the ID of the relationship does not exist create it.
    • else update the relationship (with a date id)
  4. move onto the next line.

To save many queries, I save the IDs of inserted PartA and PartB items in memory to reduce lookups.

Here is a copy of my postgresql.conf file: http://pastebin.com/y9Ld2EGz The only things I changed where the default data directory, and the memory limits.

lanrat
  • 4,344
  • 10
  • 35
  • 41
  • Watch your hard disk LED light. Is it almost always on? If so, that's your bottleneck. If it only turns on occasionally, then it's not. Simple as that. :-) – user541686 Jun 03 '11 at 04:28
  • Its on almost non-stop, but it just indicates writing to the hard disk, not that the writing is going as fast as it can. – lanrat Jun 03 '11 at 05:06
  • @mrlanrat: The program is going as fast as it can, the disk is the bottleneck. :) – user541686 Jun 03 '11 at 05:16
  • @Mehrdad ok, so short of getting a faster disk is there anything else I can do to speed it up? – lanrat Jun 03 '11 at 05:19
  • @mrlanrat: If you have enough RAM, you could put the file into a RAM disk. How much RAM do you have? – user541686 Jun 03 '11 at 05:22
  • @Mehrdad I have 5GB of RAM, the file is ~7.2GB. But I don't think reading the file (line by line) is the problem, its writing the data to the database. – lanrat Jun 03 '11 at 05:26
  • @mrlanrat: Hm... I'm not sure about Linux, but if you were using Windows and your file was compressible (e.g. text), you could have used an NTFS RAM disk and then compressed the file, fitting it into your RAM. (I've done this before.) But I don't think you can do that on Linux, sorry... can't think of any better way right now. :\ – user541686 Jun 03 '11 at 05:32
  • @Mehrdad Actually there is a way to do that in linux, using either swap or /dev/shm. But again, python (the language I am using for this) is reading the file into memory in large chunks already, I really think it is writing to the database that is slowing this down. – lanrat Jun 03 '11 at 05:36
  • Also, if I have the program run through the file and parse it, but skip importing it into the database, it takes just over an hour. – lanrat Jun 03 '11 at 05:37
  • @mrlanrat: It's quite possible, I have no idea *which* file is the bottleneck, only that at least one of them is. :) – user541686 Jun 03 '11 at 05:48
  • @mrlanrat, can you post bits from your postgresql.conf? Chances are that's where you want to spend your effort. 300 writes per second at 20 bytes per write suggests that basic tuning would solve your problem. – Sean Jun 03 '11 at 15:55
  • Posted a link to my postgresql.conf file. – lanrat Jun 03 '11 at 19:44

1 Answers1

2

You should have killed this process many, many days ago and asked not how to find the limiting factor of the app but rather:

Is there a faster way to import data into pgSQL?

Take a look at this question and then the pgSQL COPY documentation. It's possible that the import process you're running could be achieved in hours rather than weeks with the proper tools.

By the way, regardless of what RDBMS you're using, programmatic insertions of data are never as performant as native tools provided by the RDBMS's vendor to handle bulk operations such as these. For example: SQL Server has bcp, DTS/SSIS and a few other options for bulk data import/export. Oracle has its own etc.

Community
  • 1
  • 1
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
  • That looks like a great idea, but the problem is the file I am parsing is not in a nice csv or similar format. It is is its own unique format, with duplicate data. Also, the data in the file is being put into multiple tables, with many to many relationships. – lanrat Jun 03 '11 at 05:09
  • I added more information on the process I am using to import the data to my question, if I can still use the copy command with a method like this please let me know. – lanrat Jun 03 '11 at 05:20
  • I see your issue. But I would recommend, for the performance reasons already mentioned, that you import the data into a temporary table in your db first and then perform your inserts into the proper tables from there. You could/should then drop the initial table afterwards. This solution is very different from the one you've worked out and would not involve a separate app. You would do all your work within pgSQL. – Paul Sasik Jun 03 '11 at 12:25
  • 1
    PostgreSQL 9.1 is going to support `CREATE TABLE WITHOUT LOGGING` and is going to be a dream come true for people who need to bulk import data, then process and dispatch it. – Sean Jun 03 '11 at 15:54
  • @Sean: Great to know. Thanks for sharing. I'm also wondering whether a different db engine might help in this case. For example if i was in mrlanrat's predicament I would use SQL Server bcp or SSIS/DTS to import the data, massage it with T-SQL procs and then export it to Postgres either via SSIS or a set of properly-formatted files for easier use with COPY. – Paul Sasik Jun 03 '11 at 16:07
  • Naw, shouldn't be necessary with this tiny amount of data. I think @mrlanrat is getting killed by `fsync(2)` calls and a tiny amount of tuning will solve his problem. TBH, `TRIGGER`s on a `VIEW` would also be a good way to go, but that's another 9.1 feature that's also going to be fantastic.... yeah, I'm using 9.1-beta1 right now and it's smexy. pl/pgsql triggers should be fast enough for that data, if not, a C trigger would do it as optimally as possible. More food for thought, but that's a big hammer to whip out over something trivial (I think). – Sean Jun 03 '11 at 16:13
  • I hope you're being facetious. 7GB in a single, denormalized file that is to be imported into multiple tables in a db is no small amount of data. Especially since the OP's current implementation has been running for weeks. – Paul Sasik Jun 03 '11 at 16:20
  • You can usually create csv files instead of SQL inserts, and use COPY, even for complex tables. However if that's not really an option, there's several other ways of speeding things up. One is grouping many statements in a single transaction, this saves *a lot* of disk writes. Another is to batch the SQL statements if you api supports it (e.g. the Statement.addBatch() in JDBC), this saves several round trips to the database. (batching, with JDBC speed up one of our imports by a factor of 6, we batched 1000 statements, gain would be much less if the import runs on the same host as the db.) – nos Jun 03 '11 at 23:24
  • @Paul, agreed, but it's not insurmountable. 7GB at an IO bound 30-70MBps is ~3min operation. I'd use `Boost.Spirit` to convert the bulk of the import in to a `COPY` friendly format and then would throw that against a series of stored procedures. If I'm CPU bound in PostgreSQL, then I'd flip out PL/PgSQL in favor of C-based functions. 7GB per import isn't too scary without knowing more details. Right now I suspect the OP is `fsync(2)` bound due to tuning. – Sean Jun 03 '11 at 23:50
  • LOL. OK. Can we agree that there's 101 ways to perform this import and that perhaps the OP didn't pick the right one? – Paul Sasik Jun 04 '11 at 01:46
  • @nos How can I group insert statements together as a single transaction? – lanrat Jun 04 '11 at 18:06
  • Also, how can I use the copy statement to import data that will have foreign keys pointing to data in the data I am importing? – lanrat Jun 04 '11 at 18:07
  • I am now using a copy statement and things are going much faster! :) – lanrat Jun 13 '11 at 23:54