I have two tables on which in need to apply join on table_name and file_name respectively. Problem is table_name have some extra string in comparison to file_name in table 2.
Using a regex, how can I remove extra string from table_name to make it compatible for join with file_name of table 2?
TABLE 1:
table_name audit_record_count
Immunology_COVID-19_Treatment_202006221630_01.csv 1260124
Immunology_COVID-19_Trial_Design_202006221630_01.csv 2173762
Immunology_COVID-19_Planned_Treatment_202006221630_01.csv 1350135
Immunology_COVID-19_Patient_Characteristic_202006221630_01.csv 2173762
Immunology_COVID-19_Intervention_Type_202006221630_01.csv 2173762
Immunology_COVID-19_Arm_202006221630_01.csv 4
Immunology_COVID-19_Actual_Treatment_202006221630_01.csv 2173762
Immunology_COVID-19_Publication_202006221630_01.csv 2173762
Immunology_COVID-19_Outcome_202006221630_01.csv 2173762
Immunology_COVID-19_Intervention_Type_Factor_202006221630_01.csv 2173762
Immunology_COVID-19_Inclusion_Criteria_202006221630_01.csv 2173762
Immunology_COVID-19_Curation_202006221630_01.csv 2173762
TABLE 2:
file_name csv_record_count
Treatment 1260124
Trial_Design 2173762
Planned_Treatment 1350135
Patient_Characteristic 2173762
Intervention_Type 2173762
Arm 4
Actual_Treatment 2173762
Publication 2173762
Outcome 2173762
Intervention_Type_Factor 2173762
Inclusion_Criteria 2173762
Curation 2173762
What I have tried :
audit_file_df = spark.read.csv(
f"s3://{config['raw_bucket']}/{config['landing_directory']}/{config['audit_file']}/{watermark_timestamp}*.csv",
header=False, inferSchema=True) \
.withColumnRenamed("_c0", "table_name").withColumnRenamed("_c1", "audit_record_count")\
.selectExpr("regexp_extract(table_name, '^(.(?!(\\\\d{12}_\\\\d{2,4}.csv|\\\\d{12}.csv)))*', 0) AS table_name",'audit_record_count')
print("audit_file_df :",audit_file_df)
audit_file_df.show()
validation_df = audit_file_df.join(schema_validation_df, how='inner', on=audit_file_df['table_name'] == schema_validation_df['file_name']).withColumn("count_match",
col=col(
'audit_record_count') == col(
'csv_record_count'))
print("Record validation result")
validation_df.show()
I am able to remove timestamp from table_name but not able to extract file_name to make join condition work.
Addition
Immunology_COVID-19 is not fixed it might change for another file , format for table_name is:
TA_Indication_data_timestamp_nn.csv