14

I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object like the following:

Row(bar=2, foo=1)

When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.

I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

ZygD
  • 22,092
  • 39
  • 79
  • 102
rye
  • 487
  • 1
  • 5
  • 15

3 Answers3

15

Spark >= 3.0

Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:

PYSPARK_ROW_FIELD_SORTING_ENABLED=true 

Spark < 3.0

But is there any way to prevent the Row object from ordering them?

There isn't. If you provide kwargs arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.

Just use plain tuples:

rdd = sc.parallelize([(1, 2)])

and pass the schema as an argument to RDD.toDF (not to be confused with DataFrame.toDF):

rdd.toDF(["foo", "bar"])

or createDataFrame:

from pyspark.sql.types import *

spark.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])

spark.createDataFrame(rdd, schema)

You can also use namedtuples:

from collections import namedtuple

FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])

Finally you can sort columns by select:

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
  • What if I have a nested Row object? Row(foo=(Row(foo1=1,bar1=2)) – rye Feb 11 '16 at 15:54
  • 2
    Use tuples and provide schema. – zero323 Feb 11 '16 at 15:56
  • Okay. Thank you! That's what I'm currently doing. The only reason I ask is because I'm constructing a very complicated (nesting, arrays, etc) object. Was hoping there was a way to avoid creating a schema. – rye Feb 11 '16 at 15:58
  • 3
    It is better to do it anyway. You save the time required for schema inference and avoid some categories of errors. – zero323 Feb 11 '16 at 16:01
  • Passing a list of column names to toDF as in `RDD.toDF([list:str])` does not work for me on Spark 2.4.1. I have to `RDD.toDF().select(list:str)` which does work. – rjurney Apr 19 '19 at 21:09
  • @rjurney Sounds like you confused the methods - `rdd.toDF` to [monkey patched `RDD.toDF`](https://stackoverflow.com/a/32788832/10465355), which [doesn't take varargs](https://github.com/apache/spark/blob/eaa88ae5237b23fb7497838f3897a64641efe383/python/pyspark/sql/session.py#L45-L58). You from the other hand, refer to [`DataFrame.toDF`](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toDF), which indeed uses varargs, and doesn't support lists or sampling rate. – 10465355 Apr 19 '19 at 22:27
3

From documentation:

Row also can be used to create another Row like class, then it could be used to create Row objects

In this case order of columns is saved:

>>> FooRow = Row('foo', 'bar')
>>> row = FooRow(1, 2)
>>> spark.createDataFrame([row]).dtypes
[('foo', 'bigint'), ('bar', 'bigint')]
Patrick Z
  • 2,119
  • 1
  • 16
  • 10
  • 1
    But in that case you lose the ability to pass it named arguments. `FooRow(bar=2, foo=1)` will fail. – ilir May 02 '18 at 13:38
3

How to sort your original schema to match the alphabetical order of the RDD:

schema_sorted = StructType()
structfield_list_sorted = sorted(df.schema, key=lambda x: x.name)
for item in structfield_list_sorted:
    schema_sorted.add(item)
bloodrootfc
  • 1,133
  • 12
  • 10