1

I have the following summary for dataset, using pyspark on databricks

OrderMonthYear SaleAmount
2012-11-01T00:00:00.000+0000 473760.5700000001
2010-04-01T00:00:00.000+0000 490967.0900000001

I'm having dataframe error for this map function to convert OrderMonthYear into integer type

results = summary.map(lambda r: (int(r.OrderMonthYear.replace('-','')), r.SaleAmount)).toDF(["OrderMonthYear","SaleAmount"])

any ideas?

AttributeError: 'DataFrame' object has no attribute 'map'
Thanh Nguyen Van
  • 10,292
  • 6
  • 35
  • 53
  • 1
    you can't convert that to an integer because there are strings that you didn't replace (T, +, :) – mck Apr 07 '21 at 16:35
  • hey, thx for reply, the column is a timestamp.. not string DataFrame[OrderMonthYear: timestamp] – Tanai Goncalves Apr 07 '21 at 17:05
  • 1
    then why are you calling `replace`? that's a string method. – mck Apr 07 '21 at 17:24
  • got it. even when I try to use datetime functions doesn't work. ..... test = summary.select("OrderMonthYear").apply(lambda x: x.strftime('%d%m%Y')) ..... 'DataFrame' object has no attribute 'apply' .... I guess my sql call is confusing the dataframe structure? .. . data = sqlContext.read.format("csv") – Tanai Goncalves Apr 07 '21 at 17:54
  • what's your desired output? – mck Apr 07 '21 at 18:04

1 Answers1

0

Found a solution here Pyspark date yyyy-mmm-dd conversion

from datetime import datetime
from pyspark.sql.functions import col, unix_timestamp, from_unixtime, date_format
from pyspark.sql.types import DateType

df = summary.withColumn('date', from_unixtime(unix_timestamp("OrderMonthYear", 'yyyy-MMM')))


df2 = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
display(df2)

thank you @mck for the help! cheers