0

I'm trying to find the most efficient code to catch duplicate file names in a parent folder. I'm storing file tree in a MySQL database using parent-child model with lineage:

CREATE TABLE `filetable` (
    `id` int(11) NOT NULL auto_increment,
    `name` varchar(50) NOT NULL,
    `parent_id` int(11) NOT NULL,
    `path_num` varchar(255) NOT NULL,
    `path_string` text() NOT NULL,  
    PRIMARY KEY  (`id`)
) ENGINE=InnoDB;

I am considering two methods:
1. Run a query:

INSERT INTO filetable (name, parent_id, path_num, path_string) 
    SELECT '$name','$pid','$path_num','$path_string' FROM DUAL
        WHERE NOT EXISTS
            (SELECT name FROM filetable WHERE name='$name');

and in PHP:

if (mysql_affected_rows() === 0) takeAction($name);

PROS: simple sql, no need to add rows to table
CONS: double query that can get snow on large table

2.Creat unique index on a field and use:

INSERT INTO filetable (name, parent_id, path_num, path_string) 
    VALUES ('$name','$pid','$path_num','$path_string')
    ON DUPLICATE KEY UPDATE 0=0;

and in PHP:

if (mysql_affected_rows() === 2) takeAction($name); // Yes 2 for updates

PROS: UPDATE triggered on a small fraction of queries when name is duplicated
CONS: an extra row in a table

The field that is unique is a path_string, a full path to the file. In my file system it can get very long, therefore a TEXT type of field. In MySQL one can not create unique index on a TEXT field. So what I am thinking to do is to create column

`problemsolver` varchar(62) = $parent_id . '_'.$name 

and crete unique index on it.

I am expecting those queries to constitute 10% of all the queries to database.

My question is which of those two methods would you use and why? Or any better options available?

Peeech
  • 639
  • 1
  • 7
  • 15
  • How about a UNIQUE index s.t. your INSERT will simply fail and then you check the result? – Ken Cheung Sep 06 '13 at 04:26
  • I'd say add a UNIQUE key to the unique fields, then use `INSERT IGNORE` - saves the fraction of a second `INSERT ... ON DUPLICATE` needs and you're not really looking for error catching anyways it seems (not for the sake of re-doing the query, at least). Also relevant: http://stackoverflow.com/questions/548541/insert-ignore-vs-insert-on-duplicate-key-update – DACrosby Sep 06 '13 at 04:31

1 Answers1

1

Both methods would work. Method 1 is not ideal because your query is more intensive and it's performed every time which is, as you said, not needed everytime.

Method two is preferable because you only do extra work when required, and that is more efficient.

Furthermore, to lessen the likely hood of clashing values, you can integrate a timestamp or some random data in the value stored.

Tucker
  • 7,017
  • 9
  • 37
  • 55