0

I am trying to make a csv upload page for an application that Im building. It needs to be able to upload thousands of rows of data in seconds each row including a first name, last name, and phone number. The data is being uploaded to a vm that is running ubuntu server. When I run the script to upload the data it takes almost 2 minutes to upload 1500 rows. The script is using PDO, I have also made a test script in python to see if It was a php issue and the python script is just as slow. I have made csv upload scripts in the past that are exactly the same that would upload thousands of rows in seconds. We have narrowed the issue down to the script as we have tested it on other vms that are hosted closer to us and the issue still persist. Is there an obvious issue with the script or PDO that could be slowing it down? Below is the code for the script.

<?php

$servername =[Redacted];
$username = [Redacted];
$password = [Redacted];

try {
    $conn = new PDO("mysql:host=$servername;dbname=test", $username, $password);
    // set the PDO error mode to exception
    $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    echo "Connected successfully";
} catch(PDOException $e) {
    echo "Connection failed: " . $e->getMessage();
}

echo print_r($_FILES);

$fileTmpPath = $_FILES['fileToUpload']['tmp_name'];
$fileName = $_FILES['fileToUpload']['name'];
$fileSize = $_FILES['fileToUpload']['size'];
$fileType = $_FILES['fileToUpload']['type'];

$CSVfp = fopen($fileTmpPath, "r");

$final_array = [];
while (($data = fgetcsv($CSVfp)) !== false) {

    $recived = [];
    foreach ($data as $i) {
        array_push($recived, $i );
    }
    array_push($final_array, $recived);
}
echo print_r($final_array);

fclose($CSVfp);

$non_compliant_rows = [];

foreach ($final_array as $key => $row){
    $fname = preg_replace('/[^A-Za-z0-9]/', "", $row[0]);
    $lname = preg_replace('/[^A-Za-z0-9]/', "", $row[1]);
    $mobileNumber = preg_replace( '/[^0-9]/i', '', $row[2]);
    $sanatized_row = array($fname, $lname, $mobileNumber);
    $recived[$key] = $sanatized_row;
    if (strlen($mobileNumber) > 10 or strlen($mobileNumber) < 9){
        array_push($non_compliant_rows, $final_array[$key]);
        unset($final_array[$key]);
    }
}
$final_array = array_values($final_array);
echo print_r($final_array);

foreach($final_array as $item){
    try{

        $stmt = $conn->prepare("INSERT INTO bulk_sms_list(fname, lname, pn, message, send) VALUES (?, ?, ?, 'EMPTY', 1) ;");
        $stmt->execute($item);

    }catch(PDOException $e){
        echo $e;
    }

}
echo "done";

The phone numbers column has a UNIQUE constraint to prevent us from having duplicate phone numbers. We have tried to use batch inserting but if one row doesn't comply with the constraints then all of the insertions fail.

below is the schema of the table:

+---------+------+------+-----+---------+----------------+
| Field   | Type | Null | Key | Default | Extra          |
+---------+------+------+-----+---------+----------------+
| id      | int  | NO   | PRI | NULL    | auto_increment |
| fname   | text | NO   |     | NULL    |                |
| lname   | text | NO   |     | NULL    |                |
| pn      | text | NO   | UNI | NULL    |                |
| message | text | YES  |     | NULL    |                |
| send    | int  | NO   |     | 1       |                |
+---------+------+------+-----+---------+----------------

EDIT: I have timed the clean up portion of the script at the request of @aynber. the time of the clean up was 0.24208784103394 Seconds. The time it took to do the sql portion is 108.2597219944 Seconds

  • If you don't care about errors, you can use `INSERT IGNORE` instead for a bulk insert – aynber Oct 18 '22 at 14:46
  • 1
    move the prepare Outside the loop (before the loop), why would you compile the same query for each iteration of that final loop – RiggsFolly Oct 18 '22 at 14:46
  • would be interesting to know how long is spent in the cleanup process and how long it takes to do the final database update, so you know where the changes need to be made – RiggsFolly Oct 18 '22 at 14:47
  • You could also try doing the tidy on the row as you read it from the file rather than loading everything into an array and then cleaning that array and then loading the database from that array – RiggsFolly Oct 18 '22 at 14:49
  • In response to moving the prepare statement to outside of the for loop. I have tried this and it doesn't impact the time that the program takes to run – Weston Simon Oct 18 '22 at 14:50
  • The cleanup process is almost instant. It gets stuck on the database upload. – Weston Simon Oct 18 '22 at 14:51
  • _Trivial I know but it shows a lack of attention_ You load 4 variables out of `$_FILES` but only ever use One of them. – RiggsFolly Oct 18 '22 at 14:51
  • I assume the table you are loading has a lot of existing data? How much out of interest – RiggsFolly Oct 18 '22 at 14:52
  • What are the benefits of `echo print_r($_FILES);` and the two `echo print_r($final_array);` ? – Ken Lee Oct 18 '22 at 14:53
  • Currently it is empty. In the future it will have between 1k to 100k rows. It depends on the client. – Weston Simon Oct 18 '22 at 14:53
  • ___In response to moving the prepare statement to outside of the for loop. I have tried this and it doesn't impact the time that the program takes to run___ Regardless, why would you add wasted time in a situation where you are time constrained? – RiggsFolly Oct 18 '22 at 14:54
  • ___2 minutes to upload 1500 rows in an Emtpy Table___ This I think means the issue is not specifically in the coding. SHow us the schema please, and the issue may well be configuration – RiggsFolly Oct 18 '22 at 14:55
  • You're also relying on autocommit, which means overhead of starting and committing a transaction for each row. You might like my presentation [Load Data Fast!](https://www.slideshare.net/billkarwin/load-data-fast) which compares the overhead of different styles of inserting data. – Bill Karwin Oct 18 '22 at 14:56
  • @KenLee I am doing that for testing to make sure every thing is running correctly and to find out what is taking all of the process time. – Weston Simon Oct 18 '22 at 14:56
  • 1
    But you have not segmented the code in any way in order to see how long each step is taking. I wonder how you know that the cleanup takes no time etc as you seem to have no way of knowing for sure – RiggsFolly Oct 18 '22 at 14:57
  • I added the schema to the original question – Weston Simon Oct 18 '22 at 14:58
  • I know how long it is taking becuase it displays the cleaned data after it is cleaned. – Weston Simon Oct 18 '22 at 14:59
  • What do you want to do with a row if it has a duplicate phone number? Skip the row? Set the `pn` to NULL? Something else? – Bill Karwin Oct 18 '22 at 15:01
  • You know how long the whole script is taking, but not how long each part is taking. You can test it by taking advantage of [microtime](https://www.php.net/manual/en/function.microtime.php). In my scripts, I add `$start_time = microtime(TRUE);` at the top, then something like `$time_end = microtime(TRUE); $execution_time = ($time_end - $start_time); echo __LINE__." - $execution_time\n";` at various places to check the timing. – aynber Oct 18 '22 at 15:02
  • @BillKarwin I want it to fail the row because the application is for sending out sms messages and you must have a phone number for that. The names are completely not required but are nice to have for later developmnet – Weston Simon Oct 18 '22 at 15:04
  • @aynber I added the code you suggested and am timing it right now. I will upload this comment once it is done with the time that it takes to clean up the data. It took 0.24208784103394 I guess seconds to complete the clean up. – Weston Simon Oct 18 '22 at 15:08
  • [side note] I think I understand why you set `message` as text, but is it true that the fields fname, lname and pn have to be text and not char(n) /varchar(n)? – Ken Lee Oct 18 '22 at 15:08
  • Yes @KenLee. They are both are text. Would it be better process wise if the weren't? – Weston Simon Oct 18 '22 at 15:09
  • 1
    The performance difference between TEXT and VARCHAR should be insignificant in this case. Though I wonder why you allow a person's name to be 64KB. :-) – Bill Karwin Oct 18 '22 at 15:32
  • this question has been answered 1000 times. innodb default behavior to blame. Either set innodb-flush-log-at-trx-commit to 0 or 2, or wrap inserts in a transaction and skip duplicates with catch, or use multiinsert with on duplicate. – Your Common Sense Oct 18 '22 at 15:58
  • Does this answer your question? [Why is MySQL InnoDB insert so slow?](https://stackoverflow.com/questions/9819271/why-is-mysql-innodb-insert-so-slow) – Ken Lee Oct 18 '22 at 16:39
  • I tried setting innodb-flush-log-at-trx-commit to 0 and it didn't help. I then tried transactions, it cut down the time by 2 seconds. – Weston Simon Oct 18 '22 at 16:54
  • With innodb-flush-log-at-trx-commit you probably didn't set it to 0. did you check if it was changed actually? What do you mean, "with transactions"? You need only one transaction, not many. Bear in mind, these 2 solutions helped thousands. So you have to be extremely careful claiming "it didn't help". Providing some proofs at least. – Your Common Sense Oct 18 '22 at 17:34
  • @YourCommonSense How do I check that it is set to 0 after the mysql server has started? I accident typed transactions instead of transaction. – Weston Simon Oct 18 '22 at 18:05
  • I managed to set SET GLOBAL innodb_flush_log_at_trx_commit=0; in runtime and the issue still persist. I have timed how long it takes to send one query and it is 0.06 seconds which is exactly inline with the more than 80 seconds that it takes to send 1400 queries. – Weston Simon Oct 18 '22 at 19:18

1 Answers1

0

The fastest solution should be to use LOAD DATA LOCAL INFILE. Since you answered in a comment that duplicate phone numbers should result in skipping a row, you can use the IGNORE option.

Load directly from the file, instead of processing it with PHP. You can do some of your transformations in the LOAD DATA statement.

For example:

LOAD DATA INFILE ? IGNORE INTO TABLE bulk_sms_list
FIELDS TERMINATED BY ','
(@fname, @lname, @mobile)
SET fname = REGEXP_REPLACE(@fname, '[^A-Za-z0-9]', ''),
    lname = REGEXP_REPLACE(@lname, '[^A-Za-z0-9]', ''),
    pn = IF(LENGTH(@mobile) BETWEEN 9 AND 10, @mobile, NULL),
    message = 'EMPTY', 
    send = 1;

Then follow the import with some cleanup to get rid of any rows with invalid phone numbers:

DELETE FROM bulk_sms_list WHERE pn IS NULL;

Read https://dev.mysql.com/doc/refman/8.0/en/load-data.html for more information.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • The file is located on a separate machine than the vm. Will it still work then. – Weston Simon Oct 18 '22 at 15:27
  • @WestonSimon, Read about the LOCAL option of the statement. It allows the client to read the file on the machine where e.g. the PHP code runs. But you need to enable local file import when you connect, see https://stackoverflow.com/questions/13454370/pdo-and-load-data-local-infile-not-working for example. – Bill Karwin Oct 18 '22 at 15:29
  • 1
    Once you have converted to `LOAD DATA`, you may not need any Python or PHP code. (Certainly not for reading the rows.) – Rick James Oct 18 '22 at 18:46