0

I am crawling some website and store the results in sqlite, since my crawling may repeat every month, but I don't want to store the duplicate results in my database, what should I do?

For example, the first time I will

insert into tableName(A, B, C, D, E) values(a,b,c,d,e) 

then my database will have this record, the next time I crawl this website, I will do the same, but I do not want to store the duplicate results, what is the best way I can do?

Besides, what to do if I am using MySQL?

1a1a11a
  • 1,187
  • 2
  • 16
  • 25

3 Answers3

1

You can use insert ignore

SQLite

insert or ignore into tableName(A, B, C, D, E) values(a,b,c,d,e) 

MySQL

insert ignore into tableName(A, B, C, D, E) values(a,b,c,d,e) 

Be carefull that the difference between the two code are the or, they are not the same.

This will only work if they have some kind of primary key.

1

Using sqlite, create your table with a UNIQUE constraint:

CREATE TABLE tableName (A INTEGER, B INTEGER, C INTEGER, D INTEGER, E INTEGER,
UNIQUE (A, B, C, D, E));

then

INSERT OR IGNORE INTO tableName (A, B, C, D, E) values (a,b,c,d,e);

will only insert a new record if (a,b,c,d,e) is unique -- i.e. not already present in the table tableName.

To add a UNIQUE index to an already existent table:

CREATE UNIQUE INDEX unique_idx on tableName(A, B, C, D, E);

or, using MySQL, create a table with a unique index:

CREATE TABLE tableName (A INT, B INT, C INT, D INT, E INT,
UNIQUE KEY unique_idx (A, B, C, D, E))

then use INSERT IGNORE:

INSERT IGNORE INTO tableName (A, B, C, D, E) values (a,b,c,d,e) 

To add a UNIQUE KEY index to an already existent table:

ALTER TABLE tableName ADD UNIQUE KEY unique_idx (A, B, C, D, E); 
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

You can create a temporary/auxiliary table say temp_tableNamehaving same schema as tableName. Then while crawling perform like

insert into temp_tableName select * from tableName;

delete from tableName;

insert into tableName(A, B, C, D, E) values(a,b,c,d,e)  

Auxiliary table is needed cause if the crawling insert fails then you have a back up to store from. Also, consider doing this steps DELETE/INSERT in a Transaction block for atomicity.

Rahul
  • 76,197
  • 13
  • 71
  • 125
  • This helps, but is there anyway that I can just drop the duplicate if I found it is already in the table? My initial thought was to do select * from tableName where A=a, B=b, C=c, D=d, E=e and check whether there is such record. Is it a good way to do this? – 1a1a11a Jun 07 '15 at 18:53
  • Depends, If you are performing row/row insert then probably yes but would be bit costly. Since insert happens from crawler I believe it would be a bulk insert; answered keeping that in mind. – Rahul Jun 07 '15 at 18:54