I have two dataframes with 10 rows.
df1.show()
+-------------------+------------------+--------+-------+
| lat| lon|duration|stop_id|
+-------------------+------------------+--------+-------+
| -6.23748779296875| 106.6937255859375| 247| 0|
| -6.23748779296875| 106.6937255859375| 2206| 1|
| -6.23748779296875| 106.6937255859375| 609| 2|
| 0.5733972787857056|101.45503234863281| 16879| 3|
| 0.5733972787857056|101.45503234863281| 4680| 4|
| -6.851855278015137|108.64261627197266| 164| 5|
| -6.851855278015137|108.64261627197266| 220| 6|
| -6.851855278015137|108.64261627197266| 1669| 7|
|-0.9033176600933075|100.41548919677734| 30811| 8|
|-0.9033176600933075|100.41548919677734| 23404| 9|
+-------------------+------------------+--------+-------+
I would like to add the column bank_and_post
from df2
to df1
.
df2
comes from a function.
def assignPtime(x, mu, std):
mu = mu.values[0]
std = std.values[0]
x1 = np.random.normal(mu, std, 100000)
a1, b1 = np.histogram(x1, density=True)
val = x / 60
for k, v in enumerate(val):
prob = 0
for i,j in enumerate(b1[:-1]):
v1 = b1[i]
v2 = b1[i+1]
if (v >= v1) and (v < v2):
prob = a1[i]
x[k] = prob
return x
ff = pandas_udf(assignPtime, returnType=DoubleType())
df2 = df1.select(ff(col("duration"), lit(15), lit(15)).alias("ptime_bank_and_post"))
df2.show()
+--------------------+
| bank_and_post|
+--------------------+
|0.021806558032484918|
|0.014366417828826784|
|0.021806558032484918|
| 0.0|
| 0.0|
|0.021806558032484918|
|0.021806558032484918|
|0.014366417828826784|
| 0.0|
| 0.0|
+--------------------+
If I try
df2 = df2.withColumn("stop_id", monotonically_increasing_id())
I get the error
ValueError: assignment destination is read-only