More efficient way of comparing two tables in Redshift?

Question

I have a series of stored procedures that contain logic which populate tables. I edit the stored procedure logic to populate new fields into the tables. Currently, to check how the stored procedure affects the tables, I'm taking a full copy of the table before the change and storing it as a new table in the redshift database, e.g. 'Table_test', so that I can check that the row counts are the same and the columns contain the same data. This seems like a very inefficient process of storing the whole old table to use for comparison against the new version of the table.

Is there a better/more efficient way of doing this process of comparing two tables in AWS Redshift?

Are you looking for a faster way to compare these 2 tables? Or are you looking for a way to compare them without saving a full copy of the table - "table_test"? There may be faster ways to compare the tables but you didn't describe how you are doing it now. There are also ways to create "signatures" of tables that can be used to see if the result has the same "signature". These can be made for whole tables, rows, columns, blocks etc. but you haven't stated what kind of differences you want to uniquely identify - just whole tables match, columns match? — Bill Weiner, Jun 22 '22 at 15:28
Yes that's right a faster way to compare two tables. Currently, just creating a full copy of the table to compare. I would like to know how to both; a whole table match and a column match. Thanks ! — finn871, Jun 27 '22 at 09:10

Bill Weiner · Accepted Answer · 2022-11-03T18:59:45.417

What I have done in the past for comparing data between databases is to create a per column "MD5-like" signature. In your case you could do something similar on your "pre" table contents and your "post" table contents. This would only tell you which columns are different but this may be all you need.

Debug when there is a difference could be hard but you could "save" a copy of the table to S3 for debug use. This could defeat the speed you are looking for and you may only want to run this way when there is an issue or turn-on testing. You could also run such a process by "date" so that you could get the day and column that mismatches.

I've made such a signature several different ways as non-Redshift databases aren't always as fast as Redshift. Since you are comparing Redshift to Redshift the comparison process becomes easier and faster. What I'd do in this case is to preform MD5(columnN::text) for every column then convert a portion of the base64 result to BIGINT. Then you can sum() these values for every column. (SUM() is the easiest way to aggregate the column information and using a subset of the MD5 results.) Since MD5 signatures are large using a subset of the result is fine as the MD5 hash spreads the "uniqueness" across the result. Overflow can be an issue so adding a negative constant to each value can help with this. The resulting query will look something like:

select 
sum(nvl(strtol(substring({{column.column_name}}, 17, 8), 16) - (1::bigint << 31), 0))
from <<CTE>>;

This is from a jinja2 template I use for this process which allows me to read the table DDL and convert non-text columns to text in a CTE. Hopefully this snippet is clear enough on how the process works.

================= UPDATE ================

I sense there is some confusion over the use of jinja2. Jinja2 is a template language that can be used to expand text based on some input. It doesn't perform any of the SQL work that signatures and compares the tables. It is a way to make the repeating of this work for different tables faster.

Here's a example of what creating a signature for a table would look like:

select
(
sum(nvl(strtol(substring(USAF, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(WBAN, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(STATION_NAME, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(CTRY, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(STATE, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(ICAO, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(LAT, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(LON, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(ELEV, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(begin_date, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(END_date, 17, 8), 16) - (1::bigint << 31), 0))
)  as "hash"
from (
select
md5(USAF::text) as USAF
, md5(WBAN::text) as WBAN
, md5(STATION_NAME::text) as STATION_NAME
, md5(CTRY::text) as CTRY
, md5(STATE::text) as STATE
, md5(ICAO::text) as ICAO
, md5(LAT::text) as LAT
, md5(LON::text) as LON
, md5(ELEV::text) as ELEV
, md5(floor(extract(epoch from begin_date))::text) as begin_date
, md5(floor(extract(epoch from END_date))::text) as END_date
from public.gsod_stations
);

You can see that for each column there needs to be some hash (md5 in this case) calculated and how this is done is data type dependent. These hashes are then summed to create a column level hash and in tern these column level hashes are summed to create a table level hash. (The offset of 1::bigint << 31 used to help prevent overflows for large tables.) This isn't difficult to understand but creating this SQL for every table is a pain.

This is where jinja2 comes in. By having the SQL in a template and the DDL for the table in a template the SQL can be generate by jinja.

Jinja SQL template (in Redshift SQL syntax):

{% for table in tables %}
{%- if table.table_name == target or target is not defined %}

{% set vars = {"first_column":true} %}

select
(
{% for column in table.col_names -%}
{%- if not column.skip_compare -%}
{%- if vars.first_column -%}
  {%- if vars.update({"first_column": false}) -%} {%- endif -%}
{%- else -%}
  {% raw %}+ {% endraw %} 
{%- endif -%}
    sum(nvl(strtol(substring({{column.column_name}}, 17, 8), 16) - (1::bigint << 31), 0))
{%- else -%}
    -- skipping {{column.column_name}}
{%- endif %}
{% endfor -%}
  )  as "hash"
from (
select
{%- set vars = {"first_column":true} %}
{% for column in table.col_names -%}
{%- if not column.skip_compare -%}
{%- if vars.first_column -%}
  {%- if vars.update({"first_column": false}) -%} {%- endif -%}
{%- else -%}
  {% raw %}, {% endraw %}
{%- endif -%}
{%- if column.RS_col_type in ["date", "timestamp"] -%}
    md5(floor(extract(epoch from {{column.column_name}}))::text) as {{column.column_name}}
{%- elif column.RS_col_type in ["boolean", "bool"] -%}
    md5(({{column.column_name}}::int)::text) as {{column.column_name}}
{%- else -%}
    md5({{column.column_name}}::text) as {{column.column_name}}
{%- endif -%}
{%- else -%}
    -- skipping {{column.column_name}}
{%- endif %}
{% endfor -%}
  from {{table.RS_schema}}.{{table.table_name}}
)
;

{% endif %}
{% endfor %}

And an example json DDL file (contains info for 2 tables):

{"tables":  
    [
    {"table_name":"gsod_stations", 
        "RS_schema":"public",
        "col_names": [
                        {"column_name":"USAF", "RS_col_type":"varchar(10)", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"WBAN", "RS_col_type":"integer", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"STATION_NAME", "RS_col_type":"varchar(80)", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"CTRY", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"STATE", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"ICAO", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"LAT", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"LON", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"ELEV", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"begin_date", "RS_col_type":"date", "RS_col_params":"ENCODE   zstd"},
                        {"column_name":"END_date", "RS_col_type":"date", "RS_col_params":"ENCODE   zstd"}
        ],
             "RS_sort_stmnt":"SORTKEY (USAF,WBAN)",
             "RS_dist_stmnt":"DISTKEY (USAF)"},
    {"table_name":"gsod_weather_station_data", 
        "RS_schema":"public",
        "col_names": [
                        {"column_name":"station_wban", "RS_col_type":"varchar(12)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"station_id", "RS_col_type":"varchar(10)", "RS_col_params":"NOT NULL ENCODE zstd"},
                        {"column_name":"wban", "RS_col_type":"integer", "RS_col_params":"NOT NULL ENCODE zstd"},
                        {"column_name":"yearmoda", "RS_col_type":"date", "RS_col_params":"NOT NULL ENCODE raw"},
                        {"column_name":"temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"NULL ENCODE zstd"},
                        {"column_name":"tcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"dewpoint", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"dcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"slp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"slpcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"stp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"stpcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"visibility", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"vcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"wind_speed", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"wcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"max_wind_speed", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"max_wind_gust", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"max_temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"maxtflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"min_temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"mintflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"precip", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"pflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"snow_depth", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"fog", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"rain", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"snow", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"hail", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"thunder", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"tornado", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"STATION_NAME", "RS_col_type":"varchar(80)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"CTRY", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"STATE", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"ICAO", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"LAT", "RS_col_type":"decimal(8,3)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"LON", "RS_col_type":"decimal(8,3)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"ELEV", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
                        {"column_name":"begin_date", "RS_col_type":"date", "RS_col_params":"ENCODE zstd", "skip_compare":true},
                        {"column_name":"END_date", "RS_col_type":"date", "RS_col_params":"ENCODE zstd"}
        ],
            "RS_sort_stmnt":"SORTKEY (yearmoda, station_wban)",
            "RS_dist_stmnt":"DISTKEY (station_wban)"}
    ]
}

The SQL produced can be run and signatures created for each table. It is your desire to compare these signatures between tables and possibly between databases. This is exactly why I developed this process. Since which tables / databases you are comparing can vary, how you will carry out the comparison will also vary but it is basically comparing these hash values to see if the tables contain the same content.

Things to note:

While Redshift is very fast at generating md5 values and summing them, other databases are not as quick for large datasets. I have had to "dumb down" the hashes to something much simpler for such databases.
Any hash is lossy and there is some finite possibility for 2 things to hash the same or in this case the sum of 2 hashes to falsely match. The odds of this are extremely low but not zero.
Md5 hashes are very unique and every part of the hash is also unique, just not to the same degree. By selecting a section of the md5 for the sum the speed of the operation can be improved while maintaining very high confidence that there are no false compares.
This process can only indicate "match" or "mismatch". To locate the difference(s) will take some hashing of subsets of the tables AND/OR direct comparison of the table values. In other words having just the hash signature for the old version of the table won't help you if(when) you need to debug why things mismatched. Archiving the old version of the table for such debug purposes is advisable.

More efficient way of comparing two tables in Redshift?

1 Answers1

Linked