0

I have a dataset which has a combination of simple(String) and complex data types, delimited with pipe.

1111|1234567891011|ABC11|JOSE|"linkEnrollment": {"Group": [{"action": "ADD","groupType": "ROSS","groupId": "GRP-1","isValid": "Y"},{"action": "ADD","groupType": "CROSS","groupId": "GRP-2","isValid": " "}]}
2222|9876543256827|ABC22|JACK|"linkEnrollment": {"Group": [{"action": "DEL","groupType": "ROCK","groupId": "GRP-7","isValid": "N"}]}

Corresponding columns are:

UUID(String)|PID(String)|DEVID(String)|FIRSTNAME(String)|LINK(String which is a JSON)

My requirement is i need to load this data into a Hive Table using Spark Java. I need to know:

  1. How to read the above data and convert into a dataframe(using StructType schema) to insert into a Hive table.
  2. How to load the LINK column data into Hive table, what will be its data type in the table.
mck
  • 40,932
  • 13
  • 35
  • 50
user1326784
  • 627
  • 3
  • 11
  • 31

1 Answers1

1

You can read your file using Spark csv loader

Dataset<Row> ds = spark.read().format("csv")
 .option("sep", "|")
 .option("inferSchema", "true")
 .option("header", "true")
 .load("youfile.csv");

Then you can unpack your json column using from_json

val jsonData = spark.read.json(ds.map(x=>x.getString("your_column"))

How to query JSON data column using Spark DataFrames?

then you can store it in Hive and the json column will be a StructType you will be able to query

Vincent
  • 591
  • 1
  • 5
  • 19
  • I cannot use from_json directly because my data is of simple string type, not a dataframe. First i need to convert the string to a dataframe and then probably i can convert. So how to do the first step. Note: i am using spark Java, not scala – user1326784 Oct 30 '19 at 20:39
  • Doesn't matter if it's a string, spark can parse it. You can always create a new dataframe from this column: new_df = spark.read.json(your_df(your_json_string_column)), which is exactly what the from json would be doing, with an extra step of using a schema and adding it to a column in the original dataframe. API for Java, python scala are exactly the same in spark – Vincent Oct 31 '19 at 13:51
  • My question is After i read the JSON column as a dataframe, what is the code in Java to convert into JSON using from_json. What is the code after this step: Dataset linkdf = df.select("LINK"); – user1326784 Oct 31 '19 at 18:46
  • I'm sure I understand, you want to write this dataframe to Hive right ? If you already figured how to have a dataframe out of your json string, just use linkdf..saveAsTable("your_hive_table") – Vincent Oct 31 '19 at 18:53
  • No Vincent. The LINK column is of String DataType(which has the json string). If i directly store it in Hive, it will be stored as a String. I want to store in Array of Struct or similar data type in Hive. I am not sure what data type in Hive will support this – user1326784 Oct 31 '19 at 20:44
  • That's why I told you to use new_df = spark.read.json(your_df(your_json_string_column)) in my first comment, so you end up with a dataframe instead of a string and then you'll be able to store it in Hive – Vincent Nov 01 '19 at 13:25