2

I need to transfer data (the columns I need are n.inchi,n.info,n.accessions,n.chebiid from one table to another, both tables have over 20,000 rows (heroku_chemical_chemical has 50,000, thats the table I'm transfering data to).

I tried this query:

UPDATE heroku_chemical_chemical AS h, new_compounds_filtered AS n
SET 
h.inchi_identifier=n.inchi,
h.info=n.info,
h.accessions=n.accessions,
h.chebi_id=n.chebiid
WHERE h.name = n.name 
AND (h.inchi_identifier = '' OR h.inchi_identifier IS NULL); 

So basically, the chemical name that appears in table 1 (heroku_chemical_chemical) might be present in table 2, and if it is I need to get the data from table 2 for that chemical. If h.inchi_identifier has data in it, then I know that chemical is done. Problem is that query takes incredibly long to execute, I left it running overnight, but next morning it still hadn't completed so I had to cancel it. Since its a transaction, nothing was transferred. If it instead was to execute the transfer bit by bit then it would work.

I can add this to the query:

AND n.id BETWEEN 1 AND 500

to limit the query to 500 rows (from the second table where I'm transfering the data from) do it in small pieces, but then I manually have to keep rerunning the query with different values for the BETWEEN clause. Its also really slow. I'd prefer to do this in pure SQL rather than setting up a PHP script.

Is there a way to do this where it inserts data as it goes, rather than waiting until the whole transaction is complete? Also, is there a way to do this more rapidly?

Heres the table definitions:

heroku_chemical_chemical

CREATE TABLE `heroku_chemical_chemical` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text COLLATE utf8mb4_unicode_ci,
  `synonyms` text COLLATE utf8mb4_unicode_ci,
  `associated_from` text COLLATE utf8mb4_unicode_ci,
  `category_associated_from` text COLLATE utf8mb4_unicode_ci,
  `chemical_number` text COLLATE utf8mb4_unicode_ci,
  `parent_chemical_numbers` text COLLATE utf8mb4_unicode_ci,
  `category_id` text COLLATE utf8mb4_unicode_ci,
  `slug` text COLLATE utf8mb4_unicode_ci,
  `cas_rn` text COLLATE utf8mb4_unicode_ci,
  `definition` text COLLATE utf8mb4_unicode_ci,
  `drug_bank_ids` text COLLATE utf8mb4_unicode_ci,
  `foodb_id` text COLLATE utf8mb4_unicode_ci,
  `itis_id` text COLLATE utf8mb4_unicode_ci,
  `name_scientific` text COLLATE utf8mb4_unicode_ci,
  `picture_content_type` text COLLATE utf8mb4_unicode_ci,
  `picture_file_name` text COLLATE utf8mb4_unicode_ci,
  `picture_file_size` text COLLATE utf8mb4_unicode_ci,
  `wikipedia_id` text COLLATE utf8mb4_unicode_ci,
  `actor_id` text COLLATE utf8mb4_unicode_ci,
  `bio_cyc_id` text COLLATE utf8mb4_unicode_ci,
  `chebi_id` text COLLATE utf8mb4_unicode_ci,
  `chem_spider_id` text COLLATE utf8mb4_unicode_ci,
  `chembl_id` text COLLATE utf8mb4_unicode_ci,
  `ctd_id` text COLLATE utf8mb4_unicode_ci,
  `hmdb_id` text COLLATE utf8mb4_unicode_ci,
  `inchi_identifier` text COLLATE utf8mb4_unicode_ci,
  `inchi_key` text COLLATE utf8mb4_unicode_ci,
  `kegg_compound_id` text COLLATE utf8mb4_unicode_ci,
  `omim_id` text COLLATE utf8mb4_unicode_ci,
  `pdb_id` text COLLATE utf8mb4_unicode_ci,
  `pubchem_compound_id` text COLLATE utf8mb4_unicode_ci,
  `stitch_di` text COLLATE utf8mb4_unicode_ci,
  `t3db_id` text COLLATE utf8mb4_unicode_ci,
  `uni_prot_id` text COLLATE utf8mb4_unicode_ci,
  `iupac_name` text COLLATE utf8mb4_unicode_ci,
  `formula` text COLLATE utf8mb4_unicode_ci,
  `smiles` text COLLATE utf8mb4_unicode_ci,
  `chemspider_id` text COLLATE utf8mb4_unicode_ci,
  `molecular_weight` text COLLATE utf8mb4_unicode_ci,
  `accessions` text COLLATE utf8mb4_unicode_ci,
  `chebi_info` text COLLATE utf8mb4_unicode_ci,
  `chebi_name` text COLLATE utf8mb4_unicode_ci,
  `compound_type` text COLLATE utf8mb4_unicode_ci,
  PRIMARY KEY (`id`)
) 
ENGINE=InnoDB 
AUTO_INCREMENT=379336 
DEFAULT CHARSET=utf8mb4 
COLLATE=utf8mb4_unicode_ci;

new_compounds_filtered

CREATE TABLE `new_compounds_filtered` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text COLLATE utf8mb4_unicode_ci,
  `chebiid` text COLLATE utf8mb4_unicode_ci,
  `info` text COLLATE utf8mb4_unicode_ci,
  `smiles` text COLLATE utf8mb4_unicode_ci,
  `inchi` text COLLATE utf8mb4_unicode_ci,
  `inchikey` text COLLATE utf8mb4_unicode_ci,
  `parent_id` text COLLATE utf8mb4_unicode_ci,
  `accessions` text COLLATE utf8mb4_unicode_ci,
  `synonyms` text COLLATE utf8mb4_unicode_ci,
  PRIMARY KEY (`id`)
) 
ENGINE=InnoDB 
AUTO_INCREMENT=85432 
DEFAULT CHARSET=utf8mb4 
COLLATE=utf8mb4_unicode_ci ;

UPDATE:

So I'm running a new query which needs to load the ID (which is indexed) into a column in a relationship table.

UPDATE chemical_organism_relations AS O2, heroku_chemical_chemical AS H1 
SET O2.compound_id = H1.id, O2.substance_type = 'compound'
WHERE O2.foodb_compound_id = H1.foodb_id;

Again the same problem, the query doesn't seem to complete. I can't put a unique index on H1.foodb_id because only about 5% of the chemicals in the table have a FooDB ID. So I'm having the same problem. O2.compound_id is unindexed because all of its fields are blank right now, I can't index them unless I insert a temporary unique number into each of them.

Heres whats in the performance_schema.events_statements_current table:

enter image description here

I've been looking for a way to track the queries progress as it runs. One thing I'm wondering is what these columns mean: enter image description here

Is there a way to see whats going on behind the scenes? If I could see what the server is doing and how much progress it has made in running the query, then I would know if the query will ever complete, but right now I don't know whether its crashed or whats happening.

I just ran a really simple query:

mysql> UPDATE chemical_organism_relations SET substance_type = 'compound'; Query OK, 740672 rows affected (1 min 2.95 sec) Rows matched: 740672 Changed: 740672 Warnings: 0

74000, but a minute to process a simple update query involving no other tables.

Big Problem With My Query Someone commented to an answer before I merged it with the main thread, they said that because:

O2.compound_id is unindexed because all of its fields are blank

That I have a big problem with my query. I accidentally posted the wrong query, but I posted the right one now, is there still a big problem with the query?

Horse O'Houlihan
  • 1,659
  • 4
  • 14
  • 29
  • Only 50,000 rows? This shouldn't take that long. Seconds, or maybe minutes, max. It sounds like h.inchi_identifier needs an index and n.name needs an index. Are those present? If not, your server is wasting a *lot* of time on rows it will not need to modify and wasting even more time doing the match-up between tables. `SHOW CREATE TABLE ...` both tables, please. – Michael - sqlbot Dec 03 '16 at 21:56
  • No they have no index. I can't create one because theres gonna be duplicates. Would an index speed things up greatly? I can add an index to the output table (new_compounds_filtered) but not the input table. I'll add the table definitions now. – Horse O'Houlihan Dec 04 '16 at 15:09
  • 50K row is really nothing, you shouldn't got any problem before hitting the 1M row. Yes index will definitely speed up your query. – Blag Dec 04 '16 at 15:35
  • True, 50K rows is nothing, but matching 50K rows against 20K rows without indexes means 100,000,000 combinations must potentially be considered to satisfy this query, because you're effectively asking the server to do this: `for x (1 .. 50000) { for y (1 .. 20000) { compare x == y } }`. In reality is could be less but this is illustrative of the reason the query is so slow. – Michael - sqlbot Dec 04 '16 at 16:44
  • Question, is data in `heroku_chemical_chemical`.`name` unique ? (as you make an update without any other field in where, I hope it is, but just in case) – Blag Dec 04 '16 at 17:27
  • Ahh, now I see the importance of indexes. I always have a primary unique index of course, but I need indexes on all of these unique fields that I use to query. – Horse O'Houlihan Dec 05 '16 at 16:23
  • "can't put unique@foodb_id only 5% have a FooDB ID" => [ALTER TABLE column allow null](http://stackoverflow.com/questions/10597982/alter-sql-table-allow-null-column-value); update empty to null; add a Unique as null is allowed /;/ it's now ok for the compound_id part, adding an index now will just slow it (as you'll have to update every line, better add it after) /;/ 62s for 700'000 row (not 70k: 0.7M, 10x) full update is not bad with InnoDB – Blag Dec 05 '16 at 18:30

1 Answers1

1

As said, 50k lines is really nothing : things get serious after 1M and you need to think a bit if you want to go over 10M without problem...

This apart, as you do a single big transaction, you'll need to be a bit careful, the innoDB is not really configured be default for this and will have some RAM & I.O Disk bottleneck (take a look at MySQL transaction size - how big is too big? if you have to do it regularly, some DB tuning could be a good idea)


Anyway, let start by some basics :

WHERE 
    h.name = n.name 
    AND (
        h.inchi_identifier = '' 
        OR h.inchi_identifier IS NULL
    ); 

Every fields after a WHERE in a big query SHOULD mandatory have an index if you can't put one, then your DB schema is bad (like probably 99.99%, if you are in the 0.01% you already know why)

What Index you ask ? well, you have some choice, I'll only explain about the most useful ones :

PRIMARY KEY

Yes, this is the first and most important. You can only have one by table, it's also a UNIQUE constraint, if you got and identifier field, use it as your PK, no need to add a dedicated id int auto_increment one.

UNIQUE INDEX

if you already have a PK on your table but got some unique id in another field, UNIQUE INDEX could help you for quick read/update by this field (but slow down insert as it'll check the uniqueness)

INDEX, the multi-pass

"But I already have a PK, and my field is not unique, so no index?"

It's usually the case in most DB, you have a field for a Foreigner Key (1-N), you'll make many JOIN and WHERE on it, but can't use one of the previous index, then come the magic :

INDEX allow you to speed query even if there is duplicates or null in the field.

It'll slow a bit the INSERT, but you'll gain on SELECT, UPDATE, DELETE, WHERE, JOIN with this field.

Composite index

It's more a way to make your index. This is more complex than the previous ones as you really need to understand how your query work if you want this one to be used.

[in writing]


Back to your query, this is probably what you should do (keep in mind it can slow some operation, and will probably take some time to be build) :

    -- speed the null / "" detection
CREATE INDEX idx_hcc_inchi_id_1char 
    ON heroku_chemical_chemical (inchi_identifier(1));

    -- uniqueness of name
CREATE UNIQUE INDEX idx_hcc_name_u 
    ON heroku_chemical_chemical (name);
CREATE UNIQUE INDEX idx_ncf_name_u 
    ON new_compounds_filtered (name);

For the second part of your question :

"I can't put a UNIQUE INDEX on foodb_id as only 5% of the rows have a FooDB_ID"

Not a problem :

  • ALTER TABLE column allow null (already ok if I look your schema)
  • update empty string '' to NULL: UPDATE table SET column=NULL WHERE column='';
  • add a UNIQUE INDEX on the column, as NULL is allowed inside
Community
  • 1
  • 1
Blag
  • 5,818
  • 2
  • 22
  • 45
  • Thanks! I understand a whole lot better the importance of indexes now. The job I'm doing right now is setting up a database by merging together a some tables, the database is messy, fully of duplicates and invalid data so I'm fixing it up. Thats why I can't put indexes on these columns I'm using to query, cuz there are loads of duplicates. Theres no way around this right now unfortunately. I'll post an update in an answer. – Horse O'Houlihan Dec 05 '16 at 17:24
  • @SelfDecodeSupport even if you have some duplicate, you can put a simple `index` instead of a `unique index`, it'll be a whole new world for your querying ;) ; BTW you can work to clean the old data before the insert, like [querying every duplicates](http://stackoverflow.com/questions/2594829/finding-duplicate-values-in-a-sql-table) and asking the good person in your company what id you should delete/update – Blag Dec 05 '16 at 17:40
  • @SelfDecodeSupport you can export the result of the duplicates query in a CSV file, send it to the good guy, ask to edit it with excel, like : "remove all the data apart id for removing line and update data data directly in the field for the good one". You load this CSV as a table, and make an update with full base, then delete WHERE field name IS NULL ; really easy and way more efficient than some guessing/tweaking – Blag Dec 05 '16 at 17:44
  • 1
    Probably also worth covering the benefit of a compound index on `(inchi_identifier, name)` – MatBailie Dec 05 '16 at 19:39
  • 1
    I didn't realise that the columns don't have to be unique to put an index on them. Everythings running much faster now that I put indexes on all these identifier columns. – Horse O'Houlihan Dec 05 '16 at 19:54
  • @MatBailie yes, I was going to put something about compound, but my bed was stronger than me yesterday, and as there is no real use example of a compound key on this case I was wondering how to introduce it... but I'll take some time for it now ;) – Blag Dec 05 '16 at 19:54
  • @SelfDecodeSupport sorry, it wasn't clear enough in my post, I'll add it in bold; yes, that's the real point of index, you can use them nearly as you want, and for a few MB of storage it speed up everything ;) – Blag Dec 05 '16 at 19:59
  • Yeah, its crazy I only learned this now, I would have saved so much time in the past. My god, the amount of scenarios where I could have added an index to the field. – Horse O'Houlihan Dec 05 '16 at 22:38
  • @MatBaile There are loads of ways I can use compound keys here, because the table is full of unique identifiers which are mapped to each other. – Horse O'Houlihan Dec 05 '16 at 22:40