Spark writing output as fixed width

Question

Reading a fixed-width file into Spark is easy and there are multiple ways to do so. However, I could not find a way to WRITE fixed-width output from spark (2.3.1). Would converting a DF to RDD help? Currently using Pyspark but any language is welcome. Can someone suggest a way out?

score 4 · Accepted Answer · answered Nov 30 '18 at 18:10

Here is an example of what I described in the comments.

You can use pyspark.sql.functions.format_string() to format each column to a fixed width and then use pyspark.sql.functions.concat() to combine them all into one string.

For example, suppose you had the following DataFrame:

data = [
    (1, "one", "2016-01-01"),
    (2, "two", "2016-02-01"),
    (3, "three", "2016-03-01")
]

df = spark.createDataFrame(data, ["id", "value", "date"])
df.show()
#+---+-----+----------+
#| id|value|      date|
#+---+-----+----------+
#|  1|  one|2016-01-01|
#|  2|  two|2016-02-01|
#|  3|three|2016-03-01|
#+---+-----+----------+

Let's say you wanted to write out the data left-justified with a fixed width of 10

from pyspark.sql.functions import concat, format_string

fixed_width = 10
ljust = r"%-{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(ljust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|1         one       2016-01-01|
#|2         two       2016-02-01|
#|3         three     2016-03-01|
#+------------------------------+

Here we use the printf style formatting of %-10s to specify a left justified width of 10.

If instead you wanted to right-justify your strings, remove the negative sign:

rjust = r"%{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(rjust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|         1       one2016-01-01|
#|         2       two2016-02-01|
#|         3     three2016-03-01|
#+------------------------------+

Now you can write out only the fixedWidth column to your output file.

1.What would happen if number of columns are large and the total length of concatenated string is too large? Any problem with maximum length of characters stored in a single column? — Sreenath Chothar, Jan 25 '19 at 08:45
@SreenathChothar why don't you try it and see what happens? If you run into trouble, you can post a new question. — pault, Jan 25 '19 at 14:25
How to format a date column? Tried with code after converting the string column 'date' to date type column 'date1': df.withColumn('new',format_string('%tc','date1')). But it is failing with illegalformatconversionexception. We have a large number of date type columns in the dataframe that needs to be outputted as fixed width file. — Sreenath Chothar, Jan 29 '19 at 06:53
@SreenathChothar please post a new question and provide a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — pault, Jan 29 '19 at 14:17

Spark writing output as fixed width

1 Answers1

Linked

Related