1

I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.

I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:

$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');

while(($row = fgetcsv($file)) != FALSE){
    // id is the 7th value in the row
    $id = $row[6];
    
    // check table to see if id exists
    $sql = "SELECT id FROM table WHERE id = $id";
    $result = mysqli_query($conn, $sql);
    
    // if id is in the database, skip to next row
    if(mysqli_num_rows($result) > 0){
        continue;
    }
    
    // else write line to temp file
    fputcsv($temp, $row);
}

fclose($file);
fclose($temp);

// overwrite original file
rename($tempFilename, $filename);

Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?

sverdon
  • 50
  • 5
  • What are your memory requirements? Is it out of the question to load the whole file at once, saving disk access? – Torbjörn Stabo Oct 22 '22 at 18:57
  • It's not the disc access, it's those fourtyfive thousand queries to the database... – Honk der Hase Oct 22 '22 at 18:59
  • They matter too, no doubt about that. But the question was about whether the code could be made more efficient in general. It's possible that both could be improved. – Torbjörn Stabo Oct 22 '22 at 19:08
  • @TorbjörnStabo Code is running on a web server with a memory limit of 768M and max execution time of 120s. These values can't be changed. – sverdon Oct 22 '22 at 20:03
  • Not the fastest way (probably) but possibly fast enough: Prepare your query once and execute in a loop with different ids. This will at least remove the overhead from "compiling" the query again and again. You might also need to wrap it into a single transaction. – Paul Spiegel Oct 23 '22 at 04:37
  • @sverdon Fair enough. For the protocol I didn't mean "improved" by changing the config though, but by rewriting the code. – Torbjörn Stabo Oct 23 '22 at 13:37

2 Answers2

4

You fire a database query per line, aka 45.000 queries... that takes too much time.

Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.

Pseudo code:

$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
    $lookup[ $row['id'] ] = $row['id'];
}

// now read CSV
while($row = fgetcsv($h)) {
    $id = $row[6];

    if (isset($lookup[ $id ])) {
        // exist...
        continue;
    }

    // write the non-existing id to different file...
}

edit: Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?

Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database - those ids remaining exist in CSV but not in database.

Pseudo code:

$ids_csv = [];
while($row = fgetcsv($h)) {
    $id = row[6];
    $ids_csv[] = intval($id);
}

$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));

$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
    $ids_db[] = $row['id'];
}

$missing_in_db = array_diff($ids_csv, $ids_db);
Honk der Hase
  • 2,459
  • 1
  • 14
  • 26
  • Yep, this is how I'd do it. You can't prioritize memory efficiency _and_ time efficiency. Pick one. Time is almost always going to be more valuable. – Alex Howansky Oct 22 '22 at 19:11
  • @AlexHowansky My worry here is a memory issue. The table I'm querying has over 1 million rows. Adding each of these id's to an array is more than likely going to run into a memory issue (I was doing something similar to this before, and I switched methods because of a memory problem). – sverdon Oct 22 '22 at 20:06
  • With 768MB (as you say in the comment above) you will not run into a memory issue... 1 million 4-byte integer are 4 MB... array has an overhead, but even if it was 100 byte overhead per entry, you'd still end up with 100 * 1million = 100million byte (which are 100 MB). But ok... I'll edit the post and show you another way to do it efficiently :) – Honk der Hase Oct 22 '22 at 20:39
  • @HonkderHase I appreciate you going above and beyond! I tried your first solution and it seems to be working perfectly, no timeout or memory issues so far. Thank you. – sverdon Oct 22 '22 at 21:24
  • You can't rely on an 'inifinite number' of parameters passed to the `IN` clause because thare are several limitations. https://stackoverflow.com/questions/12666502/mysql-in-clause-max-number-of-arguments. I use my `LOAD DATA INFILE` answer with text files containing millions of rows on daily basis on production sites (although for a different goal then asked here). – Piemol Oct 23 '22 at 08:30
  • That is true, but the database can't handle "infinite" either... it's tedious to discuss, which system will break first :) – Honk der Hase Oct 23 '22 at 09:30
3
  1. I would use LOAD DATA INFILE: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
    Your database user needs to have FILE priveleges on the database to use. to read the csv file into a separate table.
  2. Then you can run one query to delete id's already exist (delete from join ...)
  3. And export the rows that were left intact.

Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.

Update: I use LOAD DATA INFILE with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it's blazingly fast and I would recommend this route for files containing > 100k lines.

Piemol
  • 857
  • 8
  • 17