1

We receive fixed width File which has multi header/multi section i,e. data about subgroups of company.
First record would be Organization followed by N different sections of subgroups of company operating around the world.
Below is the data

5512345worldwidenetwork123449
6634455australiannetwok123455
8823455 austestcrickifoxvf@abc.aus.co12344444
8878557 syndenycrickifoxvf@abc.aus.co45345444444
8834888 melbourcecricoxvf@abc.aus.co57645344444
99345676455
66987455uitedkingdomnetwok123455
8893855 uktestcrickixvf@zxc.uk.co87665559
8878557 londoncrickilondon@uk.co45345444444
8834889 machestermanctiry@man.uk.co57645344444
99788834
0000000

Data is organized in below fashion

  • Top row contains the organization name (row starting with Value 55 )(only on record per file)
    (Below section would repeat n number of times)
  • second row contains details of companies operating in one country(row starting with 66)
  • next N records contains data of franchises of companies operating in separate cities(row starting with 88)
  • record indicating end of sub group data( row starting with 99)
  • record indicating end of file( row starting with 00)(only on record per file)

I am mostly interested data that starts with 88 , what would the right way to get the data. I tried with below option no luck

rdd1=sqlContext.read.text("file:///home/mahesh/source/network.txt");

df = rdd1.filter(lambda x: "21" in x).map(lambda x: x).toDF()

How can i filter records starting with only 88 or is there any other option to load this data

Below is the expected output

8823455 austestcrickifoxvf@abc.aus.co12344444
8878557 syndenycrickifoxvf@abc.aus.co45345444444
8834888 melbourcecricoxvf@abc.aus.co57645344444
8893855 uktestcrickixvf@zxc.uk.co87665559
8878557 londoncrickilondon@uk.co45345444444
8834889 machestermanctiry@man.uk.co57645344444

Once i can filter data like all rows starting with 88 then i can apply fixed width parsing to get below data

FrnachiseID FrnachiseName Email ID   No of subscribers
8823455 austestcrickif  oxvf@abc.aus.co 12344444
8878557 syndenycrickifo xvf@abc.aus.co  45345444444
8834888 melbourcecric  oxvf@abc.aus.co5 7645344444
8893855 uktestcricki   xvf@zxc.uk.co     87665559
8878557 londoncricki  london@uk.co  45345444444
8834889 machester     manctiry@man.uk.co    57645344444
user2516739
  • 43
  • 1
  • 6
  • just use filter: `rdd1.filter(lambda x: x.startswith('88'))` – jxc Oct 23 '20 at 16:26
  • I get Error TypeError: condition should be string or Column – user2516739 Oct 23 '20 at 16:42
  • 1
    if you want to load into dataframe, then skip rdd, do: `df = spark.read.text("file:///home/mahesh/source/network.txt").filter("value like '88%'")` – jxc Oct 23 '20 at 17:01
  • Great this worked, please add ansere so that i can accept it , can u please let me know why did i got error "TypeError: condition should be string or Colum" – user2516739 Oct 23 '20 at 17:12
  • I can not reproduce the TypeError you mentioned with Spark 2.4.4 and 3.0.1. if you do need using RDD, try something like: `df = rdd1.filter(lambda x: x.startswith('88')).map(lambda x: [x]).toDF(['value'])` – jxc Oct 23 '20 at 17:20
  • 1
    If you want to keep every `66` line and its associated `88` lines linked together, then this is not the right way. With Spark, you can read the file in block-mode and then process the associated lines all in the same Row. For example, with your sample and spark version 2.4+, you can use spark.read.text with option `lineSep='\n99'` http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.text, For older version, you will need to handle this from RDD. check one of my old posts: https://stackoverflow.com/a/57967235/9510729 – jxc Oct 23 '20 at 17:42

0 Answers0