Reading a fixed-width file into Spark is easy and there are multiple ways to do so. However, I could not find a way to WRITE fixed-width output from spark (2.3.1). Would converting a DF to RDD help? Currently using Pyspark but any language is welcome. Can someone suggest a way out?
Asked
Active
Viewed 4,705 times
1 Answers
4
Here is an example of what I described in the comments.
You can use pyspark.sql.functions.format_string()
to format each column to a fixed width and then use pyspark.sql.functions.concat()
to combine them all into one string.
For example, suppose you had the following DataFrame:
data = [
(1, "one", "2016-01-01"),
(2, "two", "2016-02-01"),
(3, "three", "2016-03-01")
]
df = spark.createDataFrame(data, ["id", "value", "date"])
df.show()
#+---+-----+----------+
#| id|value| date|
#+---+-----+----------+
#| 1| one|2016-01-01|
#| 2| two|2016-02-01|
#| 3|three|2016-03-01|
#+---+-----+----------+
Let's say you wanted to write out the data left-justified with a fixed width of 10
from pyspark.sql.functions import concat, format_string
fixed_width = 10
ljust = r"%-{width}s".format(width=fixed_width)
df.select(
concat(*[format_string(ljust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth |
#+------------------------------+
#|1 one 2016-01-01|
#|2 two 2016-02-01|
#|3 three 2016-03-01|
#+------------------------------+
Here we use the printf
style formatting of %-10s
to specify a left justified width of 10.
If instead you wanted to right-justify your strings, remove the negative sign:
rjust = r"%{width}s".format(width=fixed_width)
df.select(
concat(*[format_string(rjust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth |
#+------------------------------+
#| 1 one2016-01-01|
#| 2 two2016-02-01|
#| 3 three2016-03-01|
#+------------------------------+
Now you can write out only the fixedWidth
column to your output file.

pault
- 41,343
- 15
- 107
- 149
-
1.What would happen if number of columns are large and the total length of concatenated string is too large? Any problem with maximum length of characters stored in a single column? – Sreenath Chothar Jan 25 '19 at 08:45
-
1@SreenathChothar why don't you try it and see what happens? If you run into trouble, you can post a new question. – pault Jan 25 '19 at 14:25
-
How to format a date column? Tried with code after converting the string column 'date' to date type column 'date1': df.withColumn('new',format_string('%tc','date1')). But it is failing with illegalformatconversionexception. We have a large number of date type columns in the dataframe that needs to be outputted as fixed width file. – Sreenath Chothar Jan 29 '19 at 06:53
-
@SreenathChothar please post a new question and provide a small [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Jan 29 '19 at 14:17