Spark parse fixed width multi section file

Question

We receive fixed width File which has multi header/multi section i,e. data about subgroups of company.
First record would be Organization followed by N different sections of subgroups of company operating around the world.
Below is the data

5512345worldwidenetwork123449
6634455australiannetwok123455
8823455 austestcrickifoxvf@abc.aus.co12344444
8878557 syndenycrickifoxvf@abc.aus.co45345444444
8834888 melbourcecricoxvf@abc.aus.co57645344444
99345676455
66987455uitedkingdomnetwok123455
8893855 uktestcrickixvf@zxc.uk.co87665559
8878557 londoncrickilondon@uk.co45345444444
8834889 machestermanctiry@man.uk.co57645344444
99788834
0000000

Data is organized in below fashion

Top row contains the organization name (row starting with Value 55 )(only on record per file)
(Below section would repeat n number of times)

second row contains details of companies operating in one country(row starting with 66)

next N records contains data of franchises of companies operating in separate cities(row starting with 88)

record indicating end of sub group data( row starting with 99)

record indicating end of file( row starting with 00)(only on record per file)

I am mostly interested data that starts with 88 , what would the right way to get the data. I tried with below option no luck

rdd1=sqlContext.read.text("file:///home/mahesh/source/network.txt");

df = rdd1.filter(lambda x: "21" in x).map(lambda x: x).toDF()

How can i filter records starting with only 88 or is there any other option to load this data

Below is the expected output

8823455 austestcrickifoxvf@abc.aus.co12344444
8878557 syndenycrickifoxvf@abc.aus.co45345444444
8834888 melbourcecricoxvf@abc.aus.co57645344444
8893855 uktestcrickixvf@zxc.uk.co87665559
8878557 londoncrickilondon@uk.co45345444444
8834889 machestermanctiry@man.uk.co57645344444

Once i can filter data like all rows starting with 88 then i can apply fixed width parsing to get below data

FrnachiseID FrnachiseName Email ID   No of subscribers
8823455 austestcrickif  oxvf@abc.aus.co 12344444
8878557 syndenycrickifo xvf@abc.aus.co  45345444444
8834888 melbourcecric  oxvf@abc.aus.co5 7645344444
8893855 uktestcricki   xvf@zxc.uk.co     87665559
8878557 londoncricki  london@uk.co  45345444444
8834889 machester     manctiry@man.uk.co    57645344444

just use filter: `rdd1.filter(lambda x: x.startswith('88'))` — jxc, Oct 23 '20 at 16:26
if you want to load into dataframe, then skip rdd, do: `df = spark.read.text("file:///home/mahesh/source/network.txt").filter("value like '88%'")` — jxc, Oct 23 '20 at 17:01
Great this worked, please add ansere so that i can accept it , can u please let me know why did i got error "TypeError: condition should be string or Colum" — user2516739, Oct 23 '20 at 17:12
I can not reproduce the TypeError you mentioned with Spark 2.4.4 and 3.0.1. if you do need using RDD, try something like: `df = rdd1.filter(lambda x: x.startswith('88')).map(lambda x: [x]).toDF(['value'])` — jxc, Oct 23 '20 at 17:20
If you want to keep every `66` line and its associated `88` lines linked together, then this is not the right way. With Spark, you can read the file in block-mode and then process the associated lines all in the same Row. For example, with your sample and spark version 2.4+, you can use spark.read.text with option `lineSep='\n99'` http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.text, For older version, you will need to handle this from RDD. check one of my old posts: https://stackoverflow.com/a/57967235/9510729 — jxc, Oct 23 '20 at 17:42

Spark parse fixed width multi section file

0 Answers0