3

I have the following code in my while loop and it is significantly slow, any suggestions on how to improve this?

open IN, "<$FileDir/$file" || Err( "Failed to open $file at location: $FileDir" );
my $linenum = 0;

while ( $line = <IN> ) {
    if ( $linenum == 0 ) {
        Log(" This is header line : $line");
        $linenum++;
    } else {
        $linenum++;
        my $csv    = Text::CSV_XS->new();
        my $status = $csv->parse($line);
        my @val    = $csv->fields();

        $index = 0;
        Log("number of parameters for this file is: $sth->{NUM_OF_PARAMS}");
        for ( $index = 0; $index <= $#val; $index++ ) {
            if ( $index < $sth->{NUM_OF_PARAMS} ) {
                $sth->bind_param( $index + 1, $val[$index] );
            }
        }

        if ( $sth->execute() ) {
            $ifa_dbh->commit();
        } else {
            Log("line $linenum insert failed");
            $ifa_dbh->rollback();
            exit(1);
        }
    }
}
zdim
  • 64,580
  • 5
  • 52
  • 81
Jeg
  • 33
  • 3
  • 2
    DB operations will be your bottleneck. If those really are "inserts", then you could batch your inserts instead of using a db transaction for each insert. – Miller Aug 20 '21 at 23:40
  • 2
    Don't create a new `Text::CSV_XS` object for each iteration of the loop. Instead create it and loop over the file using it's [`getline`](https://metacpan.org/pod/Text::CSV_XS#getline) method. – Shawn Aug 21 '21 at 00:18
  • Please add a tag for the DBMS you're using. Additionally, you can profile your code using [this](https://metacpan.org/pod/Devel::NYTProf). It's not perfect but it's probably more than adequate for the code you showed. – Jeff Holt Aug 21 '21 at 02:05
  • could anyone given an example of how to do batch insert as it is the bottleneck and performance is still slow. – Jeg Aug 24 '21 at 18:42
  • @Jeg The code `bind_params` up to the `NUM_OF_PARAMS` from fields of the CSV file. Can the number of fields change between rows of the CSV file? Do you just always take the first `NUM_OF_PARAMS` fields of the CSV file to bind, as it seems? – zdim Aug 24 '21 at 23:08
  • Hi @zdim Num of params will be same for single csv file, I also tried without bind_params as you suggested with row arrayref. Problem is with number of inserts, so looking to bulk/batch them. – Jeg Aug 24 '21 at 23:49

1 Answers1

5

By far the most expensive operation there is accessing the database server; it's a network trip, hundreds of milliseconds or some such, each time.

Are those DB operations inserts, as they appear? If so, instead of inserting row by row construct a string for an insert statement with multiple rows, in principle as many as there are, in that loop. Then run that one transaction.

Test and scale down as needed, if that adds up to too many rows. Can keep adding rows to the string for the insert statement up to a decided maximum number, insert that, then keep going.

A few more readily seen inefficiencies

  • Don't construct an object every time through the loop. Build it once befor the loop, and then use/repopulate as needed in the loop. Then, there is no need for parse+fields here, while getline is also a bit faster

  • Don't need that if statement for every read. First read one line of data, and that's your header. Then enter the loop, without ifs

Altogether, without placeholders which now may not be needed, something like

my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });

# There's a $table earlier, with its @fields to populate
my $qry = "INSERT into $table (", join(',', @fields), ") VALUES ";

open my $IN, '<', "$FileDir/$file" 
    or Err( "Failed to open $file at location: $FileDir" );

my $header_arrayref = $csv->getline($IN);
Log( "This is header line : @$header_arrayref" );

my @sql_values;
while ( my $row = $csv->getline($IN) ) {       
    # Use as many elements in the row (@$row) as there are @fields
    push @sql_values, '(' . 
        join(',', map { $dbh->quote($_) } @$row[0..$#fields]) . ')';

    # May want to do more to sanitize input further
}

$qry .= join ', ', @sql_values;

# Now $qry is readye. It is
# INSERT into table_name (f1,f2,...) VALUES (v11,v12...), (v21,v22...),...
$dbh->do($qry) or die $DBI::errstr;

I've also corrected the error handling when opening the file, since that || in the question binds too tightly in this case, and there's effectively open IN, ( "<$FileDir/$file" || Err(...) ). We need or instead of || there. Then, the three-argument open is better. See perlopentut

If you do need the placeholders, perhaps because you can't have a single insert but it must be broken into many or for security reasons, then you need to generate the exact ?-tuples for each row to be inserted, and later supply the right number of values for them.

Can assemble data first and then build the ?-tuples based on it

my $qry = "INSERT into $table (", join(',', @fields), ") VALUES ";

...

my @data;
while ( my $row = $csv->getline($IN) ) {    
    push @data, [ @$row[0..$#fields] ];
}

# Append the right number of (?,?...),... with the right number of ? in each
$qry .=  join ', ', map { '(' . join(',', ('?')x@$_) . ')' } @data;

# Now $qry is ready to bind and execute
# INSERT into table_name (f1,f2,...) VALUES (?,?,...), (?,?,...), ...
$dbh->do($qry, undef, map { @$_ } @data) or die $DBI::errstr;

This may generate a very large string, what may push the limits of your RDBMS or some other resource. In that case break @data into smaller batches. Then prepare the statement with the right number of (?,?,...) row-values for a batch, and execute in the loop over the batches.

Finally, another way altogether is to directly load data from a file using the database's tool for that particular purpose. This will be far faster than going through DBI, probably even including the need to process your input CSV into another one which will have only the needed data.

Since you don't need all data from your input CSV file, first read and process the file as above and write out a file with only the needed data (@data above). Then, there's two possible ways

  • Either use an SQL command for this – COPY in PostgreSQL, LOAD DATA [LOCAL] INFILE in MySQL and Oracle (etc); or,

  • Use a dedicated tool for importing/loading files from your RDBMS – mysqlimport (MySQL), SQL*Loader/sqlldr (Oracle), etc. I'd expect this to be the fastest way

The second of these options can also be done out of a program, by running the appropriate tool as an external command via system (or better yet via the suitable libraries).


In one application I've put together as much as millions of rows in the initial insert -- the string itself for that statement was in high tens of MB -- and that keeps running with ~100k rows inserted in a single statement daily, for a few years by now. This is postgresql on good servers, and of course ymmv.

Some RDBMS do not support a multi-row (batch) insert query like the one used here; in particular Oracle seems not to. (We were informed in the end that that's the database used here.) But there are other ways to do it in Oracle, please see links in comments, and search for more. Then the script will need to construct a different query but the principle of operation is the same.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • 1
    Let me know if specific code examples are needed for the three points made here – zdim Aug 21 '21 at 06:36
  • I believe the `getline` method of `Text::CSV` is more efficient than using `parse`. – TLP Aug 21 '21 at 10:24
  • Thank you all for your response. Zdim, could you provide code example if you can? Thanks – Jeg Aug 22 '21 at 18:18
  • @zdim, can you provide code example for the points you mentioned here as i am not familiar with perl? – Jeg Aug 23 '21 at 16:57
  • 1
    @Jeg Yes, I will as I get a minute – zdim Aug 23 '21 at 17:13
  • Thanks for the response @zdim, do you have any example on how to batch the database inserts? – Jeg Aug 24 '21 at 15:09
  • could anyone given an example of how to do batch insert as it is the bottleneck and performance is still slow after modifying as above – Jeg Aug 24 '21 at 18:42
  • @Jeg I will, when I get to it. Hopefully later in the day – zdim Aug 24 '21 at 19:54
  • @Jeg I added SQL. Still need to check the details but it should be fine – zdim Aug 25 '21 at 03:36
  • Thanks @zdim, i was able to generate single insert statement with multiple rows but it seems like not supported in Oracle. Have you done this in Oracle or Postgres? – Jeg Aug 25 '21 at 16:27
  • @Jeg I do this in Postgresql, and have tested it (for this post) in SQLite. There's been a bunch of other posts (on batch insert) based on other RDBMS. The multiple rows in a single insert is just SQL, I would think ...? I'd be surprised if any major system didn't support it (but I don't actually know). Try by hand something simple, like `insert into table-name (f1 f2) (1,2), (10,20)` or some such. Manual and simple. – zdim Aug 25 '21 at 17:38
  • @Jeg Well, it can be done but indeed not like this; there is apparently another way to do it. Here's a very active stackoverflow [page](https://stackoverflow.com/questions/39576/best-way-to-do-multi-row-insert-in-oracle) just for that (the first link I opened in Google on "oracle batch insert" search) – zdim Aug 25 '21 at 17:41
  • @Jeg And here are some documentation links, [on livesql.oracle](https://livesql.oracle.com/apex/livesql/file/content_BM1LJQ87M5CNIOKPOWPV6ZGR3.html), from [blog.oracle](https://blogs.oracle.com/oraclemagazine/bulk-processing-with-bulk-collect-and-forall), from [asktom.oracle](https://asktom.oracle.com/pls/apex/f?p=100:11:::::P11_QUESTION_ID:1583402705463). Note that you can also use SQL to directly import from a file (`COPY`), and that would probably be fastest. For that you can first process your input CSV and write out a file with only data to import, in the most suitable format – zdim Aug 25 '21 at 17:49
  • 1
    @Jeg Seems that the sql's `COPY` in Oracle only copies between tables, not from files. But there is `LOAD DATA INFILE filename INTO TABLE tablename FIELDS ( ... )` or some such (I don't know Oracle, just looked it up). So then you can process CSV as shown here and write it out, then `LOAD` it. Should be way faster than inserting it line by line! Here is some code on [blog.oracle](https://blogs.oracle.com/searchtech/loading-documents-and-other-file-data-into-the-oracle-database). That is, if links above on how to do multi-row (batch) insert on Oracle don't work out for some reason. – zdim Aug 26 '21 at 01:49
  • @Jeg It seems that `LOAD DATA INFILE` is the same as (or close to) what is in `MySQL`, which greatly expands your resources. For example, check out [this page](https://www.perlmonks.org/?node_id=263124). There is _a lot_ more out there – zdim Aug 26 '21 at 02:35
  • @Jeg I've articulated much of what's in the comments above into a few paragraphs added at the end of the answer (without the links). Now once you said that this is Oracle things change a little so please see the added comments. – zdim Aug 26 '21 at 08:15
  • Thanks so much @zdim for all your inputs, i used sqlldr and it seems to be working fine for me and loading the data much faster. – Jeg Aug 26 '21 at 14:25
  • @Jeg Yay :) I'd like to mention, even as you probably know, that using an external tool (like `sqlldr`) can also be done out of a program. The basic Perl facility for that is `system` but there are also nice libraries for managing external commands, like `IPC::System::Simple`, `Capture::Tiny`, ... and the "nuclear" option `IPC::Run`. – zdim Aug 26 '21 at 18:48