I have 2 datasets. In each one I have several columns. But I want to use only 2 columns from each dataset, without doing any join, merge or combination between the both of the datasets.
Example dataset 1:
column_dataset_1 <String> | column_dataset_1_normalized <String>
-----------------------------------------------------------------------
11882621-V021BRP161305-1 | 11882621V021BRP1613051
-----------------------------------------------------------------------
W-B.7120RP1605794 | WB7120RP1605794
-----------------------------------------------------------------------
D/57RP.1534421 | D57RP1534421
-----------------------------------------------------------------------
125858G_022BR/P070751 | 125858G022BRP070751
-----------------------------------------------------------------------
300B.5190C57/51507 | 300B5190C5751507
-----------------------------------------------------------------------
Example dataset 2
column_dataset_2 <String> | column_dataset_2_normalized <String>
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Por ejemplo, si W-B.7120RP1605794se trata de un archivo de texto, | PorejemplosiWB7120RP1605794setratadeunarchivodetexto
-------------------------------------------------------------------------------------------------------------------------------------------------------------
se abrirá en un programa de procesamiento de texto. | seabrirenunprogramadeprocesamientodetexto
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-------------------------------------------------------------------------------------------------------------------------------------------------------------
utilizados 125858G_022BR/P070751 frecuentemente (por ejemplo, un texto que describe | utilizados125858G022BRP070751frecuentementeporejemplountextoquedescribe
--------------------------------------------------------------------------------------------------------------------------------------------------------------
column_dataset_1_normalized is the result of column_dataset_1 is normalized column_dataset_2_normalized is the resut of column_dataset_2 is normalized
I want to compare column_dataset_1_normalized
if is exist in column_dataset_2_normalized
.
If yes I should extract it from column_dataset_2
Example:
WB7120RP1605794
is in the second line
of column_dataset_1_normalized
, is exist in the first line of column_dataset_2_normalized
, so I should extract
it's real value [W-B.7120RP1605794]
, from column_dataset_2
and store it in a new column in dataset 2.
And the same for 125858G022BRP070751
is in forth line in column_dataset_2_normalized
, I should extract it from column_dataset_2 [125858G_022BR/P070751]
.
The comparaison should, take one by one value of column_dataset_1_normalized
and search it in all the cell of column_dataset_2_normalized
.
For normalization I used this code to kepp only number and letter:
df = df.withColumn(
"column_normalized",
F.regexp_replace(F.col("column_to_normalize"), "[^a-zA-Z0-9]+", ""))
Someone can propose me a suggestion how can I do it ? Thank you