0

I've to read a file which has inconsistent spaces as column delimiters. Any suggestions how I can read this using Python. Eventually I need this data to be in a pyspark dataframe.

File content is as follows:

AutoID               AutoGUID                             ServerID         ReceivedUTC
244021856            B22AD225-1373-4F13-9ADE-38963BA67835 GOEQXPWEPO020    2019-11-02 13:57:25.973
Harshal Parekh
  • 5,918
  • 4
  • 21
  • 43
2713
  • 185
  • 1
  • 10
  • 1
    https://stackoverflow.com/questions/15026698/how-to-make-separator-in-pandas-read-csv-more-flexible-wrt-whitespace-for-irreg does this work for you? – E.ZY. Mar 13 '20 at 16:31

3 Answers3

0

As mentioned in this link How to change tab delimited in to comma delimited in pandas You can change the delimiters to None or to a specific character in pandas like:

pd.read_csv(filename,sep=None)  

or

file = pd.read_csv(filename, sep="\t")

Feel free to check the documentation as it might give you some hints https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

0

This file format is known as a fixed-width file. pandas has a function specifically for reading this type of file: read_fwf

By default, pandas will infer the width of each column. If you find that it has trouble doing so, you can investigate the colspecs optional parameter.

You can convert the resulting pandas.DataFrame to a pyspark DataFrame using:

spark.createDataFrame(pandas_df)

as documented by pyspark.

dspencer
  • 4,297
  • 4
  • 22
  • 43
0

In Python we can use regular expression split, we are splitting data based on inconsistent spaces.

import re
re.split("\\s+",'a b   c')
['a', 'b', 'c']

In Pyspark:

#sample data
$ cat i.txt
one two   three   four   five
six    seven    eight nine ten

cols=["col1","col2","col3","col4","col5"]
spark.sparkContext.textFile("<file_path>/i.txt").map(lambda x:re.split("\\s+",x)).toDF(cols).show()

#creating dataframe on the file with inconsistent spaces.
#+----+-----+-----+----+----+
#|col1| col2| col3|col4|col5|
#+----+-----+-----+----+----+
#| one|  two|three|four|five|
#| six|seven|eight|nine| ten|
#+----+-----+-----+----+----+
notNull
  • 30,258
  • 4
  • 35
  • 50