-1

I am trying to do a POC for current project. where we want to check if spark can be used.

Current system has batch processes which takes data from tables and modifies them based on batch code.

I am new to Apache spark, As a part of POC, I am loading a CSV file into a DataFrame using

Dataset<Row> df = sparkSession.read().format("csv").option("header", true).option("inferSchema", true).load("C:/Users/xyz/Downloads/Apache spark installation and books/data.csv");

Now based on values of Two columns(in csv now) I need to populate third column.

but in earlier system, we need to query a table and based on this two value we used to retrieve third column of that table.

and value of that column we used to populate in main table.

Now I have main table in csv format, but I am not sure how I need to save data of that other table where I need to fetch value based on two columns from main table.

Can you help with the same?

EDIT

More information :

As of my current system I have two tables A and B

Table A

col1 col2 col3

data1 data2 data3

Table B

col1 col2 col3 col4 col5 col6 .......coln

data1 data2 data3 data4 data5 data6 ..........datan

Currently what is happening is :

From table A - col2 and col3 is also present in Table B.

also col1 of Table A is present in Table B but with empty values.

so col2 and col3 values which are present in table B at col 8 and col9 are used to populate that column in table B with values of col1 in table A.

To perform this in spark using Java I have created two csv files for both tables. (Is this approach correct?) and loaded them in data frames.

Now I am not sure how to perform the above operation and update dataframe containing table B.

I hope it clarifies.

Curious one
  • 33
  • 1
  • 6
  • 1
    I think that you might want to illustrate what you are actually trying to do. i.e A small data sample. The logic to create that new column – eliasah Apr 20 '18 at 06:38
  • @eliasah I tried to elaborate, but I cant share the actual data as it is of production, sorry for the ambiguity – Curious one Apr 20 '18 at 06:50
  • It's still unclear for me. What is the final value expected in which columns ? – NayoR Apr 20 '18 at 06:54
  • create dummy data, obfuscate stuff. You can't get help if you don't give us more info – eliasah Apr 20 '18 at 10:03
  • @eliasah You are right I have mentioned CSV file structure in this question : https://stackoverflow.com/questions/49936260/apache-spark-how-to-insert-data-in-a-column-with-empty-values-in-dataframe-usi – Curious one Apr 20 '18 at 10:29
  • @NayoR I have mentioned CSV file structure in this question : https://stackoverflow.com/questions/49936260/apache-spark-how-to-insert-data-in-a-column-with-empty-values-in-dataframe-usi – Curious one Apr 20 '18 at 10:49
  • Is it the same question ?? – NayoR Apr 20 '18 at 12:35
  • @NayoR Yes it is explained differently.. – Curious one Apr 23 '18 at 06:48
  • In this case, just edit one of two and delete the other, there no point to ask 2 times the same question on Stackoverflow.. – NayoR Apr 23 '18 at 08:22
  • Possible duplicate of [Apache Spark : how to insert data in a column with empty values in dataFrame using Java](https://stackoverflow.com/questions/49936260/apache-spark-how-to-insert-data-in-a-column-with-empty-values-in-dataframe-usi) – NayoR Apr 23 '18 at 08:23

1 Answers1

0

You can use the JDBC driver to query table data and load into dataset and then use join operations on datasets to manipulate data based on your requirement.

    Dataset<Row> jdbcDF = SparkSessionBuilder.instance().getSparkSession().read().format("jdbc")
.option("url", ConnectionString)
.option("driver", driver)
.option("dbtable", Query)
.option("user", username)
.option("password", password)
.load();
developer.raj
  • 41
  • 1
  • 9
  • but what I meant is my current project is using tables, however if I want to implement this current functionality using spark do I still need to use tables? – Curious one Apr 20 '18 at 06:14
  • No its up to you. Either you can use tables and load them into datasets to perform operations or you can simply use HDFS files to load data into dataframes and continue your processing. – developer.raj Apr 20 '18 at 06:19
  • currently I am doing POC on windows machine so I have created two csv files for two tables and loaded them to different dataframes, now I need to check if two columns values in df1 matches with the two column values in df2 then I need to update third column in df2, how can that be done? and is this approach correct? – Curious one Apr 20 '18 at 06:35
  • May be this helps https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – developer.raj Apr 20 '18 at 07:54