2

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.

Here is the code,

all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
    if (row['route_count'] == 0):
        all_merged_k = all_merged_k.append(row)
    else:
        for i in range(row['route_count']):
            row1 = row.copy()
            row['Route Number'] = i
            row['Route_Broken'] = row1['routes'][i]
            all_merged_k = all_merged_k.append(row)

Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.

Sample of data:

               routes  route_count
          [[CHN-IND]]            1
[[CHN-IND],[IND-KOR]]            2

O/P data:

               routes  route_count  Broken_Route Route Number
          [[CHN-IND]]            1   [CHN-IND]       1
[[CHN-IND],[IND-KOR]]            2   [CHN-IND]       1
[[CHN-IND],[IND-KOR]]            2   [IND-KOR]       2

Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.

So below code doesn't work

df.join(df['routes'].explode().rename('Broken_Route')) \
      .assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})

or

(df.assign(Broken_Route=df['routes'],
           count=df['routes'].str.len().apply(range))
   .explode(['Broken_Route', 'count'])
)

It doesn't working if the index matches, we can see the last row, Route Number should be 1

simpleboi
  • 101
  • 6
  • Is the length of the list in `'routes` always equal to the value in `route_count`? (It would seem so, but just to verify.) – 9769953 Jan 06 '22 at 09:04
  • @mozway can you see it?, it will get visible until approved, but below is how data looks like routes [[CHN-IND]] [[CHN-IND],[IND-KOR]] route count: 1 2 – freak7 Jan 06 '22 at 09:05
  • @9769953 yes it always matches – freak7 Jan 06 '22 at 09:05
  • @freak7 Why are you editing and answering on simpleboi's question? – 9769953 Jan 06 '22 at 09:06
  • @mozway added the o/p – simpleboi Jan 06 '22 at 09:11
  • is the content lists or text? [good reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) ;) – mozway Jan 06 '22 at 09:13
  • @mozway it is list of lists, nested list – simpleboi Jan 06 '22 at 09:14
  • 1
    let us know of efficient it is in real-life ;) – mozway Jan 06 '22 at 09:29
  • You are appending to a dataframe on every loop. This is actually a slow operation because the dataframe is practically recreated on every iteration. Instead, it is more efficient to store everything in a list, then create the dataframe itself at the very end. Refer to https://stackoverflow.com/questions/27929472/improve-row-append-performance-on-pandas-dataframes for more info. – tnwei Jan 06 '22 at 09:07
  • @mozway can you please check the update 1 – simpleboi Jan 08 '22 at 19:30
  • 1
    @simpleboi easy fix, temporarily move you index as column with `reset_index`, perform the transform, set again as index with `set_index`. – mozway Jan 08 '22 at 19:57

2 Answers2

3

Are you expect something like that:

>>> df.join(df['routes'].explode().rename('Broken_Route')) \
      .assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})

                   routes  route_count Broken_Route  Route Number
0             [[CHN-IND]]            1    [CHN-IND]             1
1  [[CHN-IND], [IND-KOR]]            2    [CHN-IND]             1
1  [[CHN-IND], [IND-KOR]]            2    [IND-KOR]             2
2                                    0                          1

Setup:

data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''], 
        'route_count': [1, 2, 0]}
df = pd.DataFrame(data)

Update 1: added a record with route_count=0 and routes=''.

Corralien
  • 109,409
  • 8
  • 28
  • 52
3

You can assign the routes and counts and explode:

(df.assign(Broken_Route=df['routes'],
           count=df['routes'].str.len().apply(range))
   .explode(['Broken_Route', 'count'])
)

NB. multi-column explode requires pandas ≥1.3.0, if older use this method

output:

                   routes  route_count Broken_Route count
0             [[CHN-IND]]            1    [CHN-IND]     0
1  [[CHN-IND], [IND-KOR]]            2    [CHN-IND]     0
1  [[CHN-IND], [IND-KOR]]            2    [IND-KOR]     1
mozway
  • 194,879
  • 13
  • 39
  • 75