0

I have a table in spark, which has ID and numOfReq attributes. in ID, it is between 1 to 100 and it's not in order, and each ID can be repeated many times in the table. I want to extract rows with 1, 47, 54 and 89 IDs. I can do it with a for loop like this pseudo code:

temp = [None , None, None, None]
i = 0
for id in idList:
    temp[i] = table.filter(table['ID'] == id)
    i += 1

but it took a long time to do so. is there any filter or library which do this fast? what should I insert in my code? I need something in pyspark

MHB
  • 625
  • 2
  • 10
  • 20
  • Do you want 4 different tables for 1, 47, 54 and 89 respectively? Secondly, you use `id` in the `for` loop and then use `temp[i]`? `i` is undefined. You mention that it took you a lot of time, so did you try it in PySpark? – cph_sto Mar 06 '19 at 13:45
  • i is iteration counter and that's pseudo code. yes, I need exactly that for tables and in pyspark it took a long time to be done. – MHB Mar 06 '19 at 14:42
  • i is not a problem, the problem is to select that 4 tables. – MHB Mar 06 '19 at 14:44
  • Your pseudo-code looks fine though. – cph_sto Mar 06 '19 at 14:45
  • Check this - may be this solves your problem, though the logic used is quite similar, but with dictionary instead. https://stackoverflow.com/questions/54743574/creating-multiple-pyspark-dataframes-from-a-single-dataframe/ – cph_sto Mar 06 '19 at 14:47
  • Are you looking for `table.where(table["ID"].isin(idList))`? – pault Mar 06 '19 at 15:18

0 Answers0