I have data like this:
+---+------+
| id| col|
+---+------+
| 1|210927|
| 2|210928|
| 3|210929|
| 4|210930|
| 5|211001|
+---+------+
I want the output like below:
+---+------+----------+
| id| col| t_date1|
+---+------+----------+
| 1|210927|27-09-2021|
| 2|210928|28-09-2021|
| 3|210929|29-09-2021|
| 4|210930|30-09-2021|
| 5|211001|01-10-2021|
+---+------+----------+
Which I was able to get it using pandas
and strptime
. Below is my code:
pDF= df.toPandas()
valuesList = pDF['col'].to_list()
modifiedList = list()
for i in valuesList:
... modifiedList.append(datetime.strptime(i, "%y%m%d").strftime('%d-%m-%Y'))
pDF['t_date1']=modifiedList
df = spark.createDataFrame(pDF)
Now, the main problem is I want to avoid
using pandas
and list
since I would be dealing with millions
or even billions
of data, and pandas slowers the process when it comes to big data.
I tried various methods in spark like unixtime
, to_date
, timestamp
with the format I need but no luck, and since strptime
only works with string I can't use it directly on column. I am not willing to create a UDF since they are slow too.
The main problem is with identifying the exact year which I wasn't able to do in spark but I am looking to implement it using spark only. What needs to be changed? Where am I going wrong?