I have a very large (2.7mb) XML file with following structure:
<?xml version="1.0"?>
<Destinations>
<Destination>
<DestinationId>W4R1FG</DestinationId>
<Country>Pakistan</Country>
<City>Karachi</City>
<State>Sindh</State>
</Destination>
<Destination>
<DestinationId>D2C2FV</DestinationId>
<Country>Turkey</Country>
<City>Istanbul</City>
<State>Istanbul</State>
</Destination>
<Destination>
<DestinationId>5TFV3E</DestinationId>
<Country>Canada</Country>
<City>Toronto</City>
<State>Ontario</State>
</Destination>
... ... ...
</Destinations>
And a MySQL table "destinations" like this:
+---+--------------+----------+---------+----------+
|id |DestinationId |Country |City |State |
+---+--------------+----------+---------+----------+
|1 |W4R1FG |Pakistan |Karachi |Sindh |
+---+--------------+----------+---------+----------+
|2 |D2C2FV |Turkey |Istanbul |Istanbul |
+---+--------------+----------+---------+----------+
|3 |5TFV3E |Canada |Toronto |Ontario |
+---+--------------+----------+---------+----------+
|. |...... |...... |....... |....... |
+---+--------------+----------+---------+----------+
Now I want to process my XML and check for each destination record in MySQL table. I have to compare only DestinationId
against each record and check whether it exists in my DB table or not. If it does exist leave that record and move on, and if it doesn't exist then execute an INSERT
query to insert that record in that table.
I first tried to accomplish this using PHP foreach loop mechanism, but since data is so huge, it caused me serious performance and speed issues. Then I came up with a MySQL Procedure approach like this:
DELIMITER $$
USE `destinations`$$
DROP PROCEDURE IF EXISTS `p_import_destinations`$$
CREATE DEFINER=`root`@`localhost` PROCEDURE `p_import_destinations`(
p_xml TEXT
)
BEGIN
DECLARE v_row_index INT UNSIGNED DEFAULT 0;
DECLARE v_row_count INT UNSIGNED;
DECLARE v_xpath_row VARCHAR(255);
-- calculate the number of row elements.
SET v_row_count := extractValue(p_xml,'count(/Destinations/Destination)');
-- loop through all the row elements
WHILE v_row_index < v_row_count DO
SET v_row_index := v_row_index + 1;
SET v_xpath_row := CONCAT('/Destinations/Destination[',v_row_index,']');
INSERT IGNORE INTO destinations VALUES (
NULL,
extractValue(p_xml,CONCAT(v_xpath_row, '/child::DestinationId')),
extractValue(p_xml,CONCAT(v_xpath_row, '/child::Country')),
extractValue(p_xml,CONCAT(v_xpath_row, '/child::City')),
extractValue(p_xml,CONCAT(v_xpath_row, '/child::State'))
);
END WHILE;
END$$
DELIMITER ;
Query to call this procedure:
SET @xml := LOAD_FILE('C:/Users/Muhammad Ali/Desktop/dest.xml');
CALL p_import_destinations(@xml);
This worked perfect but I am still not sure about this approach's scalability, performance and speed. And IGNORE
clause used in this procedure skips through duplicate record but accumulates the auto increment key value. Like if it is checking row with id
3306
, if this record is a duplicate, it will not insert this in the table (which is a good thing) but will take the auto increment key 3307
, and when next time it inserts the NON-DUPLICATING record it will insert it at 3308
. This don't seems good.
Any other approach(s) to meet such requirement would be much appreciated. And please guide me if I am ok to go on with this solution? If not, why?
Just remember, I am dealing with a very huge amount of data.