1

I'm normalizing a database and need to run ~630k updates. Here is the basic structure of my tables:

domains

  • id
  • name

stats

  • domain
  • domain_id

Before, the database didnt have domains table and the domains were stored in multiple table, sometimes as lists (JSON text). I migrated every domain to the domains table and now I need to establish the relation to the stats table which has the domain column. I added the domain_id column and tried to somehow update it to match the id of the domain in the domains table. The stats table has over 23m rows with ~630k unique domains (the stats are hourly). I tried running a foreach but it takes about 2 second for each domain, adding to about 14 days to run all of them.

Here is my code so far:

First I find all the domains from the stats table that are missing in the domains table and save them in the domains table.

$statDomains = Stat::select('domain')->groupBy('domain')->lists('domain');
$domains = [];
foreach(array_chunk($statDomains , 1000) as $domains1k){
    $domains = array_merge($domains, Domain::whereIn('name', $domains1k)->lists('name'));
}
$missingDomains = [];
foreach(array_diff($statDomains , $domains) as $missingDomain){
    $missingDomains[] = ['name' => $missingDomain];
}

if(!empty($missingDomains)){
    Domain::insert($missingDomains);
}

Next I get all the domains from the domains table that are present in the stats table and update all rows from the stats table with that domain.

$domains = [];
foreach(array_chunk($statDomains, 1000) as $domains1k){
    $domains +=Domain::whereIn('name', $domains1k)->lists('name', 'id');
}
foreach($domains as $key => $domain){
        Stat::where('domain', $domain)->update(['domain_id' => $key]);
}

I would appreciate something in eloquent, query builder or just raw SQL, that will do the updating faster (up to two hours?). I did some googling and found similar question but was unable to apply to my case.

EDIT

I'm running the suggested solutions right now. In the meantime I found that two more parts of my migration are taking about 50 minutes. In the first one, I have a table domain_lists. It has a text column domains with JSON encoded domains. I'm moving those domain to the domain table and creating records on the domain_lists_domains_map table. Heres the code:

foreach(DomainList::all() as $domainList){
    $attach = [];
    $domains = json_decode($domainList->domains, true);
    foreach($domains as $domain){
        $model = Domain::where('name', '=', $domain)->first();
        if(is_null($model) && !is_null($domain)){
            $model = new Domain();
            $model->name = $domain;
            $model->save();
        }
        if(!is_null($model)){
            $attach[] = $model->id;
        }
    }
    if(!empty($attach)){
        foreach(array_chunk(array_unique($attach), 1000) as $attach1k){
            $domainList->domains()->attach($attach1k);
        }
    }
}

I already noticed, that I should probably find all the unique domain and insert them to the domains table first, but gives the solution to the previous question I feel that there might be a much better way of doing all this in just raw SQL. The second part is very similar and I can probably figure out how to solve it looking at the code of the first one. The table is categories and it also has a domains text column with JSON encoded domains. Any help greatly appreciated.

EDIT 2

Heres the query I ran to copy the existing table to a new one with the domain_id column populated:

CREATE TABLE "stats_new" AS SELECT
    "s"."domain",
    "d"."id" AS "domain_id"
FROM
    "stats" "s"
JOIN "domains" "d" ON ("s"."domain" = "d"."name")
Pawel Bieszczad
  • 12,925
  • 3
  • 37
  • 40

3 Answers3

1

Forget php in favour of raw sql - processing records in loops and multiple executed statements are making it slow. Instead run following query directly in db:

update stats s set domain_id=(select d.id from domains d where d.name=s.domain);
Radek Postołowicz
  • 4,506
  • 2
  • 30
  • 47
1

Raw SQL should be faster several orders of magnitude.

Step 1: INSERT

Insert all domain names into table domains unless they are already there:

INSERT INTO domains (name)
SELECT DISTINCT s.domain
FROM   stats s
LEFT   JOIN domains d ON d.name = s.domain
WHERE  d.name IS NULL;

There is a potential race condition, if you have concurrent write access. The simplest solution would be to lock the table domains exclusively for the transaction. Else you might run into a unique violation halfway into the operation because a concurrent transaction has committed the same domain name in between. And everything will be rolled back.

BEGIN;
LOCK TABLE domains IN EXCLUSIVE MODE;

INSERT INTO domains (name)
SELECT DISTINCT s.domain
FROM   stats s
LEFT   JOIN domains d ON d.name = s.domain
WHERE  d.name IS NULL;

COMMIT;

domains.name should be UNIQUE. That constraint is implemented with an index on the column which will help performance in the next step.

Step 2: UPDATE

To update some rows but not all:
Update all domain_id to make it a foreign key to domains.name. But don't use correlated subqueries, use an UPDATE with a FROM clause. Much faster here.

UPDATE stats s
SET    domain_id = d.id 
FROM   domains d
WHERE  d.name = s.domain
AND    domain_id IS NULL; -- assuming existing ids are correct.

Then you can drop the now redundant column stats.domain:

ALTER TABLE stats DROP column domain;

That is extremely cheap. The column is marked dead in the system catalog. The actual column value is not removed until the row is updated or vacuumed.

To further improve performance drop all indexes that are not needed for the operation directly and create them afterwards - all in the same transaction.

Or, to update in batches of n rows:

Or, since you clarified in your comment that you are updating all rows, it would be substantially cheaper to create a new table like @Tim3880 also suggested - if constraints and access patterns allow for that.

Either create a completely new table, drop the old one and rename the new one:

Or, if you need to keep the existing table in place (due to concurrent access or other constraints):

Aside: Never use non-descriptive terms like name or id as column names. That's a widespread anti-pattern. Schema should really be something like:

CREATE TABLE domain (
   domain_id serial PRIMARY KEY
 , domain    text UNIQUE NOT NULL  -- guessing it should be UNIQUE
);

CREATE TABLE stats (
   stats_id  serial PRIMARY KEY
 , domain_id int REFERENCES domain
 -- , domain text  -- can be deleted after above normalization.
);
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Great answer! I'm running this query right now. I also updated my question with another part that's kinda slow. If you could help that would be great. I'll accept the answer as soon as the migration finishes without errors. – Pawel Bieszczad May 14 '15 at 14:36
  • The update is running for about 20 minutes now. Is it possible to do it in batches of 1k or 10k? The update process is taking 20% CPU, the disk usage is increasing very slowly. None of the rows are yet updated in the databse. – Pawel Bieszczad May 14 '15 at 15:01
  • @PawelBieszczad: Since you are updating all rows, the table will be twice the size when it's done (unless it was bloated with dead tuples before). You might run `VACUUM FULL` or `CLUSTER` when it's through. If concurrent transactions lock rows, the update may take a very long time. I added links to faster alternatives. It's possible to update in batches, too. I added another link for that, too – Erwin Brandstetter May 14 '15 at 15:18
1

Erwin's solution should be good enough and you should be able to do it in 2 hours.

If you have a real big stats table, you may want to skip the last update step. Just create a new table of the primary key of stats and the domain_id.

Tim3880
  • 2,563
  • 1
  • 11
  • 14