0

I have a large main table (~80m rows, ~60GB size) which has company names (NAME) in it. To be able to identify the same company by name (as well as can be done), I have done some standardisation in Python (e.g. removed punctuation) to create a table with (NAME-STD_NAME).

For example, EXAMPLE (NAMEA) would get standardised to EXAMPLE NAMEA and EXAMPLE $NAMEA$ would also get standardised to EXAMPLE NAMEA so I have a table which is something like:

NAME STD_NAME
EXAMPLE (NAMEA) EXAMPLE NAMEA
EXAMPLE $NAMEA$ EXAMPLE NAMEA
... ...

(this is relatively small compared to the main table)

I want to join the STD_NAME's back to be able to do some aggregation using code below:

UPDATE A
SET A.STD_NAME = B.STD_NAME
FROM MAIN AS A
LEFT JOIN STD_NAME_TABLE AS B
ON A.NAME = B.NAME
;

However, this is extremely expensive and seems to eat up all of my disk space (200GB). I'm not sure if I'm overlooking something or missing a simple way to do this. I have also tried just selecting the NAME column from MAIN and I get the same result.

Not sure if I should maybe be doing this in row batches (e.g. How to update large table with millions of rows in SQL Server?)

Dale K
  • 25,246
  • 15
  • 42
  • 71
tbbarr
  • 15
  • 6
  • Is 200 GB the end of the world? I would fire away. This is a one time exercise right? – kjmerf Jul 21 '21 at 20:29
  • More or less once off but it seems like it should be more space efficient? Also, it costs more than 200GB - 200GB is just all the space I have at the moment so I know it uses at least 200GB. – tbbarr Jul 21 '21 at 20:36
  • I guess you could try creating a new table with that join and then drop the old table and switch the names. – kjmerf Jul 21 '21 at 20:39
  • Index both tables `(NAME) INCLUDE (STD_NAME)`, although that itself may take a while and use up disk space. – Charlieface Jul 21 '21 at 20:45
  • Also, if table A has many indexes, this could be horribly slow. If this is a one-off maneuver, you may be able to drop the indexes on table A, do the update quickly, then start adding the indexes back. Obviously if you had to do this in a production environment, you would have to determine what the consequences would be, and how long they exist. – Robert Sievers Jul 21 '21 at 20:55
  • Is it likely to improve things if I give each `NAME` an identifier number and then match on those instead of names? e.g. in the example above, "EXAMPLE (NAMEA)" gets ID = 1 and then the next ordered name is 2 in both databases etc – tbbarr Jul 21 '21 at 21:03
  • @tbbarr If you check the execution plan, you're probably getting a hash join anyway. That is, the database is already collapsing the strings to shorter datatypes to facilitate the index lookup. – MatBailie Jul 21 '21 at 21:17
  • If you're intent on updating the table in place (rather than create a new table then rename it) just do it in "batches". `WHERE A.NAME LIKE 'A%'` followed by `WHERE A.NAME LIKE 'B%'` and so on. Probably orchestrated by a python script. – MatBailie Jul 21 '21 at 21:21

1 Answers1

0

Try adding a WHERE clause:

WHERE A.STD_Name <> B.STD_Name 

That will limit the scope of the query and help keep the working set and transaction log more reasonable.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • But since all of the `STD_NAME` are `NULL` in `MAIN` that WHERE clause will always be true, no? – tbbarr Jul 21 '21 at 21:19
  • Ahh.. I misunderstood you. I thought you had already done this once, and were further normalizing use an additional set of names. In that case, _still use a WHERE clause_, but do it to break this up over several jobs: `WHERE B.Std_Name LIKE 'A%'` then `WHERE B.Std_Name LIKE 'B%'`, then `WHERE B.Std_Name LIKE 'C%'` etc. – Joel Coehoorn Jul 21 '21 at 21:34