5

I am still in very early stage of my learning of Python. Apologize in advance if this question sounds stupid.

I have this set of data (in table format) that I want to add few calculated columns to. Basically I have some location lon/lat and destination lon/lat, and the respective data time, and I'm calculating the average velocity between each pair.

Sample data look like this:

print(data_all.head(3))

   id    lon_evnt   lat_evnt          event_time  \
0   1 -179.942833  41.012467 2017-12-13 21:17:54   
1   2 -177.552817  41.416400 2017-12-14 03:16:00   
2   3 -175.096567  41.403650 2017-12-14 09:14:06   

  dest_data_generate_time   lat_dest    lon_dest  \
0 2017-12-13 22:33:37.980  37.798599 -121.292193   
1 2017-12-14 04:33:44.393  37.798599 -121.292193   
2 2017-12-14 10:33:51.629  37.798599 -121.292193  

                             address_fields_dest  \
0  {'address': 'Nestle Way', 'city': 'Lathrop...      
1  {'address': 'Nestle Way', 'city': 'Lathrop...      
2  {'address': 'Nestle Way', 'city': 'Lathrop...      

I then zipped the lon/lat together:

data_all['ping_location'] = list(zip(data_all.lon_evnt, data_all.lat_evnt))
data_all['destination'] = list(zip(data_all.lon_dest, data_all.lat_dest))

then I want to calculate the distance between each pair of location pings, and grab some address info from a string (basically taking a substring), and then calculate for the velocity:

 for idx, row in data_all.iterrows():
    dist = gcd.dist(row['destination'], row['ping_location'])
    data_all.loc[idx, 'gc_distance'] = dist

    temp_idx = str(row['address_fields_dest']).find(":")
    pos_start = temp_idx + 3
    pos_end = str(row['address_fields_dest']).find(",") - 2

    data_all.loc[idx, 'destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]

    ##### calculate velocity which is: v = d/t
    ## time is the difference btwn destination time and the ping creation time
    timediff = abs(row['dest_data_generate_time'] - row['event_time'])
    data_all.loc[idx, 'velocity km/hr'] = 0

    ## check if the time dif btwn destination and event ping is more than a minute long
    if timediff > datetime.timedelta(minutes=1):
        data_all.loc[idx, 'velocity km/hr'] = dist / timediff.total_seconds() * 3600.0

ok now, this program took me almost 7 hours to execute on 333k rows of data! :( I have windows 10 2 core 16gb ram... which is not much, but 7 hours is definitely not ok :(

How can I make the program run more efficiently? One way I'm thinking is, since the data and its calculations are independent of each other, I can take advantage of parallel processing.

I've read into many posts, but it seems like most of the parallel processing methods presented are for if I'm only using one simple function; but here I'm adding multiple new columns.

Any help is really appreciated! or telling me that this is impossible to make pandas do parallel processing (which I believe I've read somewhere saying that but am not completely sure if it's 100% true still).

Sample posts read into:

Large Pandas Dataframe parallel processing

python pandas dataframe to dictionary

How do I parallelize a simple Python loop?

How to do parallel programming in Python

and a lot more that are not on stackoverflow....

https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a

https://homes.cs.washington.edu/~jmschr/lectures/Parallel_Processing_in_Python.html

alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49
  • Possible duplicate of [How do I parallelize a simple Python loop?](https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop) – Daniel Feb 22 '18 at 18:36
  • What's killing you is using `data_all.loc[...] = ` a bunch of times in a for-loop. Don't do that. These operations here could easily be achieved with vectorized operations. – juanpa.arrivillaga Feb 22 '18 at 18:37
  • @dangom hi i already looked into that post.... but like i said, that's more fore simple one function call. not creating multiple new columns. but if you could help me understand how i can apply that, that'd be great as well! thank you! – alwaysaskingquestions Feb 22 '18 at 18:41
  • @juanpa.arrivillaga thanks for pointing it out! however, i have an if statement to check if the time diff is greater than 1 min, then assign a specific flag. how can i use vectorized operation for that? also i'll need to grab the location data which is substring and has diff start/end position. how can i vectorize that operation? would really appreciate if you could provide some pointers! thank you again! – alwaysaskingquestions Feb 22 '18 at 18:42
  • `df[abs(df['dest_data_generate_time', 'velocity km/hr'] - df['event_time']) > datetime.timedelta(minutes=1)] = ....` – juanpa.arrivillaga Feb 22 '18 at 18:44
  • As for vectorized string operations `df['some_column'].str.find` for example, you can also vectorize slicing operations: `df['some_column'].str[pos_start:pos_end]` – juanpa.arrivillaga Feb 22 '18 at 18:53
  • @juanpa.arrivillaga but the post_start and pos_end could be different for each row. how to define those for each row without keep adding new columns to the df? – alwaysaskingquestions Feb 22 '18 at 18:56
  • Although, I would just totally rethink whatever is going on with `address_fields_dest`... that looks like a `str` representation fo a `dict`, how did that get in there in the first place? – juanpa.arrivillaga Feb 22 '18 at 18:56
  • @alwaysaskingquestions the **same way**. `temp_idx = df['address_fields_dest'].str.find(':')` ... – juanpa.arrivillaga Feb 22 '18 at 18:56
  • Anyway, even if you simply accumulated your results into a `list` and did one big `df.loc[...] = ....` at the end, it would be **much faster**. – juanpa.arrivillaga Feb 22 '18 at 18:59
  • thanks a lot for the advice! i'll think about this a bit more. i think you are right, i should take advantage of the vectorized operation offered by pandas. – alwaysaskingquestions Feb 22 '18 at 18:59
  • @alwaysaskingquestions what you should *definitely* not do is do `df.loc[...] = ...` a bunch of times in a lop to set a bunch of single rows. This will **always** be slow, and your first step should be fixing that rather than parallelizing that broken implementation. – juanpa.arrivillaga Feb 22 '18 at 19:00
  • got it! thanks so much for your valuable advice. am going to change up my script. was not aware that could cause so much issue. – alwaysaskingquestions Feb 22 '18 at 19:02
  • Also, again, *why are you converting a `dict` into a `str` to split on `':'` to grab a value in the `dict`*? Don't do that. You probably should n't have a `dict` objects in your dataframe to begin with, but if you are, you might as well use it – juanpa.arrivillaga Feb 22 '18 at 19:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165671/discussion-between-alwaysaskingquestions-and-juanpa-arrivillaga). – alwaysaskingquestions Feb 22 '18 at 19:04

1 Answers1

3

Here is a quick solution - I didn't try to optimize your code at all, just fed it into a multiprocessing pool. This will run your function on each row individually, return a row with the new properties, and create a new dataframe from this output.

import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count())

def func( arg ):
    idx,row = arg

    dist = gcd.dist(row['destination'], row['ping_location'])
    row['gc_distance'] = dist

    temp_idx = str(row['address_fields_dest']).find(":")
    pos_start = temp_idx + 3
    pos_end = str(row['address_fields_dest']).find(",") - 2

    row['destination address'] = str(row['address_fields_dest'])[pos_start:pos_end]

    ##### calculate velocity which is: v = d/t
    ## time is the difference btwn destination time and the ping creation time
    timediff = abs(row['dest_data_generate_time'] - row['event_time'])
    row['velocity km/hr'] = 0

    ## check if the time dif btwn destination and event ping is more than a minute long
    if timediff > datetime.timedelta(minutes=1):
       row['velocity km/hr'] = dist / timediff.total_seconds() * 3600.0

    return row

new_rows = pool.map( func, [(idx,row) for idx,row in data_all.iterrows()])
data_all_new = pd.concat( new_rows )
Sevy
  • 688
  • 4
  • 11