I am trying to do a POC for current project. where we want to check if spark can be used.
Current system has batch processes which takes data from tables and modifies them based on batch code.
I am new to Apache spark, As a part of POC, I am loading a CSV file into a DataFrame using
Dataset<Row> df = sparkSession.read().format("csv").option("header", true).option("inferSchema", true).load("C:/Users/xyz/Downloads/Apache spark installation and books/data.csv");
Now based on values of Two columns(in csv now) I need to populate third column.
but in earlier system, we need to query a table and based on this two value we used to retrieve third column of that table.
and value of that column we used to populate in main table.
Now I have main table in csv format, but I am not sure how I need to save data of that other table where I need to fetch value based on two columns from main table.
Can you help with the same?
EDIT
More information :
As of my current system I have two tables A and B
Table A
col1 col2 col3
data1 data2 data3
Table B
col1 col2 col3 col4 col5 col6 .......coln
data1 data2 data3 data4 data5 data6 ..........datan
Currently what is happening is :
From table A - col2 and col3 is also present in Table B.
also col1 of Table A is present in Table B but with empty values.
so col2 and col3 values which are present in table B at col 8 and col9 are used to populate that column in table B with values of col1 in table A.
To perform this in spark using Java I have created two csv files for both tables. (Is this approach correct?) and loaded them in data frames.
Now I am not sure how to perform the above operation and update dataframe containing table B.
I hope it clarifies.