2

This post does a great job of showing how parse a fixed width text file into a Spark dataframe with pyspark (pyspark parse text file).

I have several text files I want to parse, but they each have slightly different schemas. Rather than having to write out the same procedure for each one like the previous post suggests, I'd like to write a generic function that can parse a fixed width text file given the widths and column names.

I'm pretty new to pyspark so I'm not sure how to write a select statement where the number of columns, and their types is variable.

Any help would be appreciated!

freddyab
  • 21
  • 1
  • 2

1 Answers1

4

Say we have a text file like the one in the example thread:

00101292017you1234
00201302017 me5678

in "/tmp/sample.txt". And a dictionary containing for each file name, a list of columns and a list of width:

schema_dict = {
    "sample": {
        "columns": ["id", "date", "string", "integer"], 
        "width" : [3, 8, 3, 4]
    }
}

We can load the dataframes and split them into columns iteratively, using:

import numpy as np

input_path = "/tmp/"
df_dict = dict()
for file in schema_dict.keys():
    df = spark.read.text(input_path + file + ".txt")
    start_list = np.cumsum([1] + schema_dict[file]["width"]).tolist()[:-1]
    df_dict[file] = df.select(
        [
            df.value.substr(
                start_list[i], 
                schema_dict[file]["width"][i]
            ).alias(schema_dict[file]["columns"][i]) for i in range(len(start_list))
        ]
    )

    +---+--------+------+-------+
    | id|    date|string|integer|
    +---+--------+------+-------+
    |001|01292017|   you|   1234|
    |002|01302017|    me|   5678|
    +---+--------+------+-------+
MaFF
  • 9,551
  • 2
  • 32
  • 41