I have csv file: dbname1.table1.csv:
|target | source |source_table |relation_type|
---------------------------------------------------------------------------------------
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | protocol_dttm | custom_cib_ml_stg.p_overall_part_tend_cust | direct
avg_ensure_sum_12m | inn_num | custom_cib_ml_stg.p_overall_part_tend_cust | indirect
csv format for this table:
target,source,source_table,relation_type
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,protocol_dttm,custom_cib_ml_stg.p_overall_part_tend_cust,direct
avg_ensure_sum_12m,inn_num,custom_cib_ml_stg.p_overall_part_tend_cust,indirect
Then I create a dataframe by reading it:
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
Now I need to create a new dataframe based on dfDL.
The structure of the new dataframe looks like this:
case class DataLink(schema_from: String,
table_from: String,
column_from: String,
link_type: String,
schema_to: String,
table_to: String,
column_to: String)
The information for the fields of the new DataFrame is obtained from a csv file:
pseudocode:
schema_from = source_table.split(".")(0) // Example: custom_cib_ml_stg
table_from = source_table.split(".")(1) // Example: p_overall_part_tend_cust
column_from = source // Example: inn_num
link_type = relation_type // Example: direct
schema_to = "dbname1.table1.csv".split(".")(0) // Example: dbname1
table_to = "dbname1.table1.csv".split(".")(1) // Example: table1
column_to = target // Example: avg_ensure_sum_12m
I need to create a new dataframe. I can't cope on my own.
P.S. I need this dataframe to create a json file from it later. Example JSON:
[{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"inn_num",
"link_type":"direct",
"schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"
},
{"schema_from":"custom_cib_ml36_stg",
"table_from":"p_overall_part_tend_cust",
"column_from":"protocol_dttm",
"link_type":"direct","schema_to":"dbname1",
"table_to":"table1",
"column_to":"avg_ensure_sum_12m"}
I don't like my current implementation:
def readDLFromHDFS(file: LocatedFileStatus): Array[DataLink] = {
val arrTableName = file.getPath.getName.split("\\.")
val (schemaTo, tableTo) = (arrTableName(0), arrTableName(1))
val dfDL = spark.read.option("delimiter", ",")
.option("header", true)
.csv(file.getPath.toUri.getPath)
//val sourceTable = dfDL.select("source_table").collect().map(value => value.toString().split("."))
dfDL.collect.map(row => DataLink(row.getString(2).split("\\.")(0),
row.getString(2).split("\\.")(1),
row.getString(1),
row.getString(3),
schemaTo,
tableTo,
row.getString(0)))
}
def toJSON(dataLinks: Array[DataLink]): Option[JValue] =
dataLinks.map(Extraction.decompose).reduceOption(_ ++ _)
}