-1

I have a fixed width text file(sample) with data

2107abc2018abn2019gfh

where all the rows data are combined as single row I need to read the textfile and split data according fixed row length=7 and generate multiple rows and store it in RDD.

2107abc

2018abn

2019gfh

where 2107 is one column and abc is one more column

will the logic will be applicable for huge data file like 1 GB or more?

vindev
  • 2,240
  • 2
  • 13
  • 20
sonia
  • 29
  • 5
  • You can try map operation to split each row into multiple arrays. Like Array(Array()). Then do flatMap to get the Array() with one dimension each row. – Bhima Rao Gogineni Dec 19 '18 at 06:57
  • this is a similar question to yours: https://stackoverflow.com/questions/52031127/how-to-read-a-fixed-character-length-format-file-in-spark but they assume here that the file contains multiple lines. – Mahmoud Hanafy Dec 19 '18 at 07:18
  • You can split your file to multiple lines first then apply the suggested solution. – Mahmoud Hanafy Dec 19 '18 at 07:19

1 Answers1

2

I'm amusing that you have RDD[String] and you want to extract both columns from your data. First you can split the line at length 7 and then again at 4. You will get your columns separated. Below is the code for same.

//creating a sample RDD from the given string
val rdd = sc.parallelize(Seq("""2107abc2018abn2019gfh"""))

//Now first split at length 7 then again split at length 4 and create dataframe
val res = rdd.flatMap(_.grouped(7).map(x=>x.grouped(4).toSeq)).map(x=> (x(0),x(1)))

//print the rdd
res.foreach(println)

//output
//(2107,abc)
//(2018,abn)
//(2019,gfh)

If you want you can also convert your RDD to dataframe for further processing.

//convert to DF
val df = res.toDF("col1","col2")

//print the dataframe
df.show
//+----+----+
//|col1|col2|
//+----+----+
//|2107| abc|
//|2018| abn|
//|2019| gfh|
//+----+----+
vindev
  • 2,240
  • 2
  • 13
  • 20