how to change the iterrows method to apply

Question

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.

Here is the code,

all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
    if (row['route_count'] == 0):
        all_merged_k = all_merged_k.append(row)
    else:
        for i in range(row['route_count']):
            row1 = row.copy()
            row['Route Number'] = i
            row['Route_Broken'] = row1['routes'][i]
            all_merged_k = all_merged_k.append(row)

Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.

Sample of data:

               routes  route_count
          [[CHN-IND]]            1
[[CHN-IND],[IND-KOR]]            2

O/P data:

               routes  route_count  Broken_Route Route Number
          [[CHN-IND]]            1   [CHN-IND]       1
[[CHN-IND],[IND-KOR]]            2   [CHN-IND]       1
[[CHN-IND],[IND-KOR]]            2   [IND-KOR]       2

Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.

So below code doesn't work

df.join(df['routes'].explode().rename('Broken_Route')) \
      .assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})

or

(df.assign(Broken_Route=df['routes'],
           count=df['routes'].str.len().apply(range))
   .explode(['Broken_Route', 'count'])
)

It doesn't working if the index matches, we can see the last row, Route Number should be 1

Is the length of the list in `'routes` always equal to the value in `route_count`? (It would seem so, but just to verify.) — 9769953, Jan 06 '22 at 09:04
@mozway can you see it?, it will get visible until approved, but below is how data looks like routes [[CHN-IND]] [[CHN-IND],[IND-KOR]] route count: 1 2 — freak7, Jan 06 '22 at 09:05
@freak7 Why are you editing and answering on simpleboi's question? — 9769953, Jan 06 '22 at 09:06
is the content lists or text? [good reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) ;) — mozway, Jan 06 '22 at 09:13
You are appending to a dataframe on every loop. This is actually a slow operation because the dataframe is practically recreated on every iteration. Instead, it is more efficient to store everything in a list, then create the dataframe itself at the very end. Refer to https://stackoverflow.com/questions/27929472/improve-row-append-performance-on-pandas-dataframes for more info. — tnwei, Jan 06 '22 at 09:07
@simpleboi easy fix, temporarily move you index as column with `reset_index`, perform the transform, set again as index with `set_index`. — mozway, Jan 08 '22 at 19:57

Corralien · Accepted Answer · 2022-01-06T09:27:51.830

3

Are you expect something like that:

>>> df.join(df['routes'].explode().rename('Broken_Route')) \
      .assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})

                   routes  route_count Broken_Route  Route Number
0             [[CHN-IND]]            1    [CHN-IND]             1
1  [[CHN-IND], [IND-KOR]]            2    [CHN-IND]             1
1  [[CHN-IND], [IND-KOR]]            2    [IND-KOR]             2
2                                    0                          1

Setup:

data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''], 
        'route_count': [1, 2, 0]}
df = pd.DataFrame(data)

Update 1: added a record with route_count=0 and routes=''.

edited Jan 06 '22 at 09:27

answered Jan 06 '22 at 09:10

Corralien

109,409
8
28
52

[lazy mode] do you have the constructor to share?[/lazy mode] :p – mozway Jan 06 '22 at 09:13
1

Suppose: `{'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']]], 'route_count': [1, 2]}` – Corralien Jan 06 '22 at 09:14
@simpleboi. Can you check my update please? – Corralien Jan 06 '22 at 09:16
@Corralien checking. – simpleboi Jan 06 '22 at 09:17
@freak7. When `route_count == 0`, What is the value of `routes`? An empty list, nan? – Corralien Jan 06 '22 at 09:18
I'd have use more or less the same (with `assign` in place of `join`) +1. I provided an alternative for fun – mozway Jan 06 '22 at 09:22
@Corralien in that case it will be empty – simpleboi Jan 06 '22 at 09:22
@mozway. It's a great answer too. +1. – Corralien Jan 06 '22 at 09:25
@Corralien but likely less efficient, I find it fun though ;) – mozway Jan 06 '22 at 09:25
@simpleboi. Is it faster? :) – Corralien Jan 06 '22 at 09:29
Here when I thoroughly looked and I saw, in same index I have row with different value and thus it giving wrong route count. So level=0 is not a good idea then also I have column more than 10. – simpleboi Jan 08 '22 at 19:17
@Corralien pls suggest help – simpleboi Jan 08 '22 at 19:17
@Corralien can you please check the update 1 – simpleboi Jan 08 '22 at 19:30

score 3 · Answer 2 · answered Jan 06 '22 at 09:19

3

You can assign the routes and counts and explode:

(df.assign(Broken_Route=df['routes'],
           count=df['routes'].str.len().apply(range))
   .explode(['Broken_Route', 'count'])
)

NB. multi-column explode requires pandas ≥1.3.0, if older use this method

output:

                   routes  route_count Broken_Route count
0             [[CHN-IND]]            1    [CHN-IND]     0
1  [[CHN-IND], [IND-KOR]]            2    [CHN-IND]     0
1  [[CHN-IND], [IND-KOR]]            2    [IND-KOR]     1

answered Jan 06 '22 at 09:19

mozway

194,879
13
39
75

thanks a lot it really helpful too. – simpleboi Jan 06 '22 at 09:29
I was looking into this code as well, in same index I have row with different value and thus it giving wrong route count, do we have any fix for this. – simpleboi Jan 08 '22 at 19:15

how to change the iterrows method to apply

2 Answers2