1

I tried to run fillna to insert nan in column with special character "."

df = spark.createDataFrame(
    [(None, None), ('U1', None), ('U3', 1.0)], 
    ['USER_ID', 'a.b']
)

I tried

df = df.fillna({"`a.b`": float("nan")})

also

df = df.fillna({"a.b": float("nan")})

Both of them doesn't work, who have experience on this?

pault
  • 41,343
  • 15
  • 107
  • 149
yi wang
  • 13
  • 3

2 Answers2

1

It seems that there is a limitation of pyspark.sql.DataFrame.fillna() which doesn't allow you to specify column names with periods in them when you use the value parameter as a dictionary.

From the docs:

value – int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.

You should be able to use fillna using the other syntax that specifies both the value and subset parameters.

df.fillna(value=float("nan"), subset=["a.b"]).show()
#+-------+---+
#|USER_ID|a.b|
#+-------+---+
#|   null|NaN|
#|     U1|NaN|
#|     U3|1.0|
#+-------+---+

The above worked for me in Spark 2.4, but I don't see why it should not work on older version.

If you are still having trouble, another way to do this would be to temporarily rename your columns, call fillna, and then rename the columns back to the original values:

Here I will rename the columns to replace the "." with the string "_DOT_", which I deliberately picked to avoid conflicting with existing substrings in other column names.

df.toDF(*[c.replace(".", "_DOT_") for c in df.columns])\
    .fillna({"a_DOT_b": float("nan")})\
    .toDF(*df.columns)\
    .show()
#+-------+---+
#|USER_ID|a.b|
#+-------+---+
#|   null|NaN|
#|     U1|NaN|
#|     U3|1.0|
#+-------+---+
pault
  • 41,343
  • 15
  • 107
  • 149
0

This is working.

df = spark.createDataFrame([(None, None), ('U1', None), ('U3', 1.0)], ['USER_ID', 'a.b'])
df = df.fillna(float("nan"), ['`a.b`'])
df.show(10, False)

+-------+---+
|USER_ID|a.b|
+-------+---+
|null   |NaN|
|U1     |NaN|
|U3     |1.0|
+-------+---+
Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • running in my notebook it still failed – yi wang Aug 14 '20 at 06:36
  • An error was encountered: u'Cannot resolve column name "a.b" among (USER_ID, a.b);' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1672, in fillna return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in – yi wang Aug 14 '20 at 06:37
  • what version of spark and python? – Lamanus Aug 14 '20 at 06:39
  • Actually, it works once i tested it again. Thanks – yi wang Dec 21 '20 at 18:43