I have a large main table (~80m rows, ~60GB size) which has company names (NAME
) in it. To be able to identify the same company by name (as well as can be done), I have done some standardisation in Python (e.g. removed punctuation) to create a table with (NAME
-STD_NAME
).
For example, EXAMPLE (NAMEA) would get standardised to EXAMPLE NAMEA and EXAMPLE $NAMEA$ would also get standardised to EXAMPLE NAMEA so I have a table which is something like:
NAME | STD_NAME |
---|---|
EXAMPLE (NAMEA) | EXAMPLE NAMEA |
EXAMPLE $NAMEA$ | EXAMPLE NAMEA |
... | ... |
(this is relatively small compared to the main table)
I want to join the STD_NAME
's back to be able to do some aggregation using code below:
UPDATE A
SET A.STD_NAME = B.STD_NAME
FROM MAIN AS A
LEFT JOIN STD_NAME_TABLE AS B
ON A.NAME = B.NAME
;
However, this is extremely expensive and seems to eat up all of my disk space (200GB). I'm not sure if I'm overlooking something or missing a simple way to do this. I have also tried just selecting the NAME
column from MAIN
and I get the same result.
Not sure if I should maybe be doing this in row batches (e.g. How to update large table with millions of rows in SQL Server?)