0

My Input is a file in which JSON has been condensed into one column:

Example of raw data file:

1   JWTERRELIG.HMS_834_061118.txt   "{""loopid"":""ST"",""segments"":[{""segmentid"":""ST"",""content"":""834:00015:005010X220A1""},{""segmentid"":""BGN"",""content"":""00:4736:20180611:1513::::4""}]}"   "{""loopid"":""1000A"",""segments"":[{""segmentid"":""N1"",""content"":""P5:PLUMBERS & PIPEFITTERS LOCAL 160:FI:37-6134925""}]}"    "{""loopid"":"""",""segments"":null}"   "{""loopid"":""1000C"",""segments"":[{""segmentid"":""N1"",""content"":""TV:J.W.TERRILL BENEFIT ADMINISTRATORS:FI:431834988""}]}"   "{""loopid"":""2000"",""segments"":[{""segmentid"":""INS"",""content"":""Y:18:030:XN:A:E::FT""},{""segmentid"":""REF"",""content"":""0F:999999999""},{""segmentid"":""REF"",""content"":""1L:001765""},{""segmentid"":""REF"",""content"":""23:999999999""},{""segmentid"":""REF"",""content"":""ABB:01""}]}"   "[{""loopid"":""2100A"",""segments"":[{""segmentid"":""NM1"",""content"":""IL:1:XXXX:XXX:I:::34:99999999""}]}]" "[{""loopid"":""2300"",""segments"":[{""segmentid"":""HD"",""content"":""030::HLT:PPO:EMP""},{""segmentid"":""DTP"",""content"":""348:D8:20180127""},{""segmentid"":""REF"",""content"":""1L:001765""}]}]"

enter image description here

What I am not looking for: Anyone to write code for me.

What I am looking for: A push in the right direction.

The end goal would be to loop over this input data and assign real world titles to different positions. For example, If the second "SegementId" = "REF" and the first "Content" = "OF" Then "99999999" is the social_sec_number.

I am brand new to scala and spark and have been thrust on this project as a Dev Analyst I with basically nobody else in the company doing much in big data that I can go to for questions.

I know I can achieve the outcome I need by using spark dataframes but it seems terribly inefficient. I am doing a bunch of "regexreplace" and "splits". Creating new dataframes every time to further manipulate the data. Seems like looping over an array would be a much better bet.

I have been doing a bunch of reading on udfs, exploding, from_json, read.json. I have to go down these paths and get stuck which wastes valuable time when I do not know if it is even the most efficient path to go down.

For example, I am using Spark 2.0.0 and from_json is not available until 2.1

Just basically looking for someone to tell me the tools they would use to start this (i.e. Create a UDF based on explode and explode on ___) and I will do the research on how to make it work.

Current Path I am heading down:

val mydf1 = loadTable("raw_file","\t","file:///Users/Documents/PrepayWorkspace/PrePaySolution/EightThreeFour/src/main/Resources/Copy of Raw_Data_Set_Stored_As_Array.txt")       

     //working Loop 2000

     val mydf2  =  mydf1.filter("column_1='1'").select("column_7")

val mydf3 = mydf2.withColumn("_tmp", split($"column_7", "\\:\\[")).select(
        regexp_replace(regexp_replace($"_tmp".getItem(1),"\"",""),"\\]\\}","").as("col1")          
        ).drop("_tmp").toDF()

 val mydf4 = mydf3.withColumn("_tmp", split($"col1", "\\,\\{")).select(
         $"_tmp".getItem(0).as("col0"),
         $"_tmp".getItem(1).as("col1"),
         $"_tmp".getItem(2).as("col2"),
         $"_tmp".getItem(3).as("col3"),
         $"_tmp".getItem(4).as("col4"),  
         $"_tmp".getItem(5).as("col5") 
        ).drop("_tmp").toDF()

val mydf5 = mydf4.select(regexp_replace(regexp_replace($"col0","\\{",""),"\\}","").as("col0"),regexp_replace(regexp_replace($"col1","\\{",""),"\\}","").as("col1")
                    ,regexp_replace(regexp_replace($"col2","\\{",""),"\\}","").as("col2"),regexp_replace(regexp_replace($"col3","\\{",""),"\\}","").as("col3")
                    ,regexp_replace(regexp_replace($"col4","\\{",""),"\\}","").as("col4"),regexp_replace(regexp_replace($"col5","\\{",""),"\\}","").as("col5"))

        mydf5.show(20,false)

Current Output: enter image description here

Desired Output at the end: Referencing paragraph 4 above. A dataframe with column "social_sec_num" with row "999999999"

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
Bleach
  • 561
  • 4
  • 11

0 Answers0