pyspark dataframe foreach to fill a list

Question

I'm working in Spark 1.6.1 and Python 2.7 and I have this thing to solve:

Get a dataframe A with X rows
For each row in A, depending on a field, create one or more rows of a new dataframe B
Save that new dataframe B

The solution that I've come up right now, is to collect dataframe A, go over it, append to a list the row(s) of B and then create the dataframe B from that list.

With this solution i obviously lose all the perks of working with dataframes and I would like to use foreach, but I can't find a way to make this work. I've tried this so far:

Pass an empty list to the foreach function (this just ignores the foreach function and doesn't do anything)
Create a global variable to be use in the foreach function (complains that it can't find the list)

Does anyone has any ideas?

Thank you

----------------------EDIT:

Examples of the things I've tried:

def f(row, list):
    if row.one:
        list += [Row(type='one', field='ok')]
    else:
        list += [Row(type='one', field='ok')]
        list += [Row(type='two', field='nok')]

list = []
dfA.foreach(lambda x : f(x, list))

As I mention, this does nothing, it doesn't execute the function

And I've also tried (which list defined at the beginning of the class):

global list
def f(row):
    if row.one:
        list += [Row(type='one', field='ok')]
    else:
        list += [Row(type='one', field='ok')]
        list += [Row(type='two', field='nok')]

dfA.foreach(list)

---------EDIT 2:

What I'm doing right now is:

    list = []
    for row in dfA.collect():
        string = re.search(a_regex, row['raw'])
        if string:
            dates = re.findall(date_regex, string.group())
            for date in dates:
                date_string = datetime.strptime(date, '%Y-%m-%d').date()
                list += [Row(event_type='1', event_date=date_string)]

        b_string = re.search(b_regex, row['raw'])
        if b_string:
            dates = re.findall(date_regex, b_string.group())
            for date in dates:
                scheduled_to = datetime.strptime(date, '%Y-%m-%d').date()
                list += [Row(event_type='2', event_date= date_string)]

and then:

dfB = self._sql_context.createDataFrame(list)

dfA is given by other process, I can't change it and i know it's a very stupid way of using dataframes but I can't do anything about that

--------------------EDIT3: dfA.raw sample:

{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|

and the regex:

a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'

dfA.select('raw').show(2,False)

+-------------------------------------------------------------------------------------------------------+
|raw                                                                                                    |
+-------------------------------------------------------------------------------------------------------+
|{"new":[{"start":"2018-03-24","end":"2018-03-30","scheduled_by_system":null}],"removed":[]}|
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}|
+-------------------------------------------------------------------------------------------------------+
only showing top 2 rows

df.select('raw').printSchema()

root
 |-- raw: string (nullable = true)

please share what you've tried, sample input and expected output too. — Ramesh Maharjan, Feb 26 '18 at 11:10
what is the condition to get the X rows from A. Can you post with sample input and expected output? — Ramesh Maharjan, Feb 26 '18 at 11:52
also explain `For each row in A, depending on a field, create one or more rows of a new dataframe B` — Ramesh Maharjan, Feb 26 '18 at 12:03
hi ramesh, first of all, let me paste what i have right now so you see what i'm trying to do: — frm, Feb 26 '18 at 12:09
now put some sample input data or dataframe, explain the logic as you already have done in the question and finally put the expected dataframe. If you do that you are certain to get an answer not only from me but from others too. — Ramesh Maharjan, Feb 26 '18 at 12:23
You can just include the `raw` column from dfA. and what are `a_regex`, `date_regex` and `b_regex`? — Ramesh Maharjan, Feb 26 '18 at 12:28
dfA.raw is a stringType right? just do `dfA.select('raq').show(2, False)` and `dfA.select('raw').printSchema()` and copy paste them as well. we are very close now :) — Ramesh Maharjan, Feb 26 '18 at 12:52
i've added that info, it's kind of the same as the examples i put before — frm, Feb 26 '18 at 13:21

Ramesh Maharjan · Accepted Answer · 2018-02-26T18:50:52.090

You would need to write a udf function to return the event_type and event_date strings after you have selected the required raw column.

import re
def searchUdf(regex, dateRegex, x):
    list_return = []
    string = re.search(regex, x)
    if string:
        dates = re.findall(dateRegex, string.group())
        for date in dates:
            date_string = datetime.strptime(date, '%Y-%m-%d').date()
            list_return.append(date_string)
    return list_return

from pyspark.sql import functions as F
udfFunctionCall = F.udf(searchUdf, T.ArrayType(T.DateType()))

The udf function would parse the raw column string with the regex and dateRegex passed as arguments and return eventType and data_string as arrayType column

You should be calling the udf function defined and filter out the empty rows and then separate the columns as event_type and event_date columns

df = df.select("raw")
adf = df.select(F.lit(1).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date"))\
    .filter(F.size(F.col("event_date")) > 0)

bdf = df.select(F.lit(2).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date")) \
    .filter(F.size(F.col("event_date")) > 0)

The regex used are provided in the question as

a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'

Now that you have two dataframes for both event_type, final step is to merge them together

adf.unionAll(bdf)

And thats it. Your confusion is all solved.

With the following raw column

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|raw                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

You should be getting

+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|event_type|event_date                                                                                                                                                                                                  |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1         |[2018-03-10]                                                                                                                                                                                                |
|1         |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
|2         |[2018-03-10]                                                                                                                                                                                                |
|2         |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

hey Ramesh, that looks very good... i'm going to try it with all the cases (new but not removed, removed but no new and a lot of news and a lot of removes) to see how it works, and i'll let you know (and accept the answer if everything goes well :) ) a lot of thanks for the huge help — frm, Feb 26 '18 at 14:57
Hi Ramesh, I'm sorry but it doesn't work u_u. For this input: {"new":[],"removed":[{"start":"2018-02-24","end":"2018-03-02"},{"start":"2018-03-03","end":"2018-03-09"},{"start":"2018-03-10","end":"2018-03-16"},{"start":"2018-03-17","end":"2018-03-23"},{"start":"2018-03-24","end":"2018-03-30"}]} it only gets the first one, as far as i know, **filter** will get the rows with at least one entry, but reviewing the code you sent, only gets the first one... am i missing something? why is not working? thanks! — frm, Feb 26 '18 at 16:30
can you explain by you saying `it only gets the first one` ? — Ramesh Maharjan, Feb 26 '18 at 16:47
because i only get a df with one row that has event_type = 2 and event_date=2018-02-24 (which is the first row) — frm, Feb 26 '18 at 16:52
I have updated my answer with input dataframe before the udf function, filter, select and union and output dataframe after applying all the functions. Please comment if the output is not what is required. — Ramesh Maharjan, Feb 26 '18 at 17:06
with that, you get from three input, four output dataframes, one for the first, one for the second, and two for the third, but the third has a lot of starts and ends, and we should be getting a row for each of them — frm, Feb 26 '18 at 17:11
I have updated the answer :) its more concise and I guess I got it correct this time. :) — Ramesh Maharjan, Feb 26 '18 at 18:51
hi Ramesh, with that solution the output is: event_type - array of dates, what i need is a row for each date, so if for one event_type we have more than one date, have it in different rows, not in the same as an array :) — frm, Feb 27 '18 at 08:39
just use explode function on event_date and you should be fine — Ramesh Maharjan, Feb 27 '18 at 08:57
yes i was just doing that just now hahha, great, thanks a lot for the help Ramesh! — frm, Feb 27 '18 at 09:06
great feeling to have helped you :) and thanks for the acceptance and upvote :) — Ramesh Maharjan, Feb 27 '18 at 09:07
not just help but also taught me a couple of tricks for the future :) — frm, Feb 27 '18 at 09:15

pyspark dataframe foreach to fill a list

1 Answers1