Spark for Python - can't cast a string column to decimal/double

Question

In all the posted questions for this action, I couldn't find something that works.

I was trying several versions, in all of them I have this DataFrame:

dataFrame = spark.read.format("com.mongodb.spark.sql").load()

The printout of dataFrame.printSchema() is

root
 |-- SensorId: string (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- device: string (nullable = true)
 |-- deviceType: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- gen_val: string (nullable = true)
 |-- lane_id: string (nullable = true)
 |-- system_id: string (nullable = true)
 |-- time: string (nullable = true)

After the DataFrame is created, I want to cast the column 'gen_val'(that is stored in the variable results.inputColumns) from String type to Double type. Different versions led to different errors.

Version #1

Code:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame[results.inputColumns].cast('double'))

using cast(DoubleType()) instead, will generate the same error

Error:

AttributeError: 'DataFrame' object has no attribute 'cast'

Version #2

Code:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame['gen_val'].cast('double'))

even though this option is not really releveant, because the parameter cannot be hard-coded...

Error:

dataFrame = dataFrame.withColumn(results.inputColumns, dataFrame['gen_val'].cast('double'))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1502, in withColumn
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o31.withColumn. Trace:
py4j.Py4JException: Method withColumn([class java.util.ArrayList, class org.apache.spark.sql.Column]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
        at py4j.Gateway.invoke(Gateway.java:272)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

All of the errors here are caused by 'inputColumns' being a list. In case #1, if you pass a list like dataFrame[list], it will return a new DataFrame object with the columns you specified. The DataFrame does not have a 'cast' function, hence the error. If you pass a string instead, like dataFrame[str], it will return a Column object, which has a cast function. In case #2, you have got past the first issue, but now the Py4J exception says there is no withColumn function that takes a List as the first parameter. It must be a string, specifying the new column name. — halfnuts, Mar 23 '21 at 03:45

desertnaut · Answer 1 · 2017-10-25T13:02:59.567

3

It is not very clear what you are trying to do; the first argument of withColumn should be a dataframe column name, either an existing one (to be modified) or a new one (to be created), while (at least in your version 1) you use it as if results.inputColums were already a column (which is not).

In any case,casting a string to double type is straighforward; here is a toy example:

spark.version
# u'2.2.0'

from pyspark.sql.types import DoubleType

df = spark.createDataFrame([("foo", '1'), ("bar", '2')], schema=['A', 'B'])
df
# DataFrame[A: string, B: string]
df.show()
# +---+---+ 
# |  A|  B|
# +---+---+
# |foo|  1| 
# |bar|  2|
# +---+---+

df2 = df.withColumn('B', df['B'].cast('double'))
df2.show()
# +---+---+ 
# |  A|  B|
# +---+---+
# |foo|1.0| 
# |bar|2.0|
# +---+---+
df2
# DataFrame[A: string, B: double]

In your case, this should do the job:

from pyspark.sql.types import DoubleType
new_df = dataframe.withColumn('gen_val', dataframe['gen_val'].cast('double'))

edited Oct 25 '17 at 13:02

answered Oct 25 '17 at 12:43

desertnaut

57,590
26
140
166

I did had a mistake in my question, regarding the variable with the column name. Ignoring this, I still had a problem extracting the column programatically, without using a hard coded name(dataframe['gen_val']). Using a variable that holds the string doesn't work. I actually found a solution for that - another way to get the column is dataFrame.gen_value, which means I can also get it like that - getattr(dataFrame, colNameVar) – Ran P Oct 25 '17 at 13:27
@RanP Thing is, mistakes in the question or not, answers take valuable time for the respondents, and since I have arguably addressed the error faced in your 'Version 2' attempt, upvoting as a courtesy would be most welcome... – desertnaut Oct 25 '17 at 14:17

score 0 · Accepted Answer · edited Oct 25 '17 at 12:46

I tried something else and it worked - instead of altering the input column data, I created a casted/transformed column. I think it's less efficient, but that's what I have for the moment.

dataFrame = spark.read.format("com.mongodb.spark.sql").load()
col = dataFrame.gen_val.cast('double')
dataFrame = dataFrame.withColumn('doubled', col.cast('double'))
assembler = VectorAssembler(inputCols=["doubled"], outputCol="features")
output = assembler.transform(dataFrame)

For Zhang Tong: That's the printout of dataFrame.printSchema():

root
 |-- SensorId: string (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _type: string (nullable = true)
 |-- device: string (nullable = true)
 |-- deviceType: string (nullable = true)
 |-- event_id: string (nullable = true)
 |-- gen_val: string (nullable = true)
 |-- lane_id: string (nullable = true)
 |-- system_id: string (nullable = true)
 |-- time: string (nullable = true)

Anyway, this is a very basic transformation, and in the (near) future I will need to do more complexed ones. If any of you know about good examples, instruction or documentation for Dataframes transformation with spark and Python, I will be grateful as hell.

Spark for Python - can't cast a string column to decimal/double

2 Answers2