1

I have a python list (p_list) with 0 and 1 with as many elements as a spark dataframe that has one column only (all elements, are like: 'imaj7felb438l6hk', ....).

And I am trying to add this list as column into the spark dataframe (df_cookie). But there is no key. So far I tried:

1) Transform df_cookie into rdd, doesn't work, as it is really big and I run out of memory

2) Transform df_cookie into a pandas df, doesn't work (same reasons as 1))

3) Transform the list into a new dataframe, and use monotonically_increasing_id(), to get a common key and link both. This doesn't work either, as i end up with different ids in each dataframe.

Any suggestions?

test_list = [i for i in range(cookie.count())]
res = spark.createDataFrame(test_list, IntegerType()).toDF('ind')
df_res = res.withColumn('row', monotonically_increasing_id())
df_res.show(5)
+---+---+
|ind|row|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+

df_cookie = cookie.withColumn('row', monotonically_increasing_id())
df_cookie.show(5)
+--------------------+-----------+
|              cookie|        row|
+--------------------+-----------+
|    imaj7felb438l6hk|68719476736|
|hk3l641k5r1m2umv2...|68719476737|
|    ims1arqgxczr6rfm|68719476738|
|2t4rlplypc1ks1hnf...|68719476739|
|17gpx1x3j5eq03dpw...|68719476740|
+--------------------+-----------+

Desired output:

+--------------------+-----------+
|              cookie|        ind|
+--------------------+-----------+
|    imaj7felb438l6hk|          0|
|hk3l641k5r1m2umv2...|          1|
|    ims1arqgxczr6rfm|          2|
|2t4rlplypc1ks1hnf...|          3|
|17gpx1x3j5eq03dpw...|          4|
+--------------------+-----------+
desertnaut
  • 57,590
  • 26
  • 140
  • 166
ARS
  • 11
  • 3
  • Please provide a sample of your data, as well as the code you have tried so far & the results, otherwise it is impossible to help; see [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – desertnaut Nov 02 '17 at 12:18
  • 1
    I just edited my questions, including pieces of the code. – ARS Nov 02 '17 at 12:33
  • Good, but still the *desired* outcome is missing - provide an example of what exactly your desired result is – desertnaut Nov 02 '17 at 12:47
  • Hope it's enough now. – ARS Nov 02 '17 at 12:51
  • Not sure it is possible - see https://stackoverflow.com/questions/32760888/pyspark-dataframes-way-to-enumerate-without-converting-to-pandas?noredirect=1&lq=1 ; and from the [`monotonically_increasing_id` docs](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id), "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive." – desertnaut Nov 02 '17 at 13:03
  • That doesn't really help. I also tried to do this: from pyspark.sql.window import Window w = Window.orderBy() indexed = cookie.withColumn("index", row_number().over(w)) But I get the following error: u'Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;' – ARS Nov 02 '17 at 13:13

0 Answers0