0

I have a DataFrame routes with the following structure :

      id                                             nodes                            traveltimes
0   id-1                                  [node-A, node-B]                                  [6.0]
1   id-2                  [node-A, node-C, node-D, node-E]                      [4.0, 80.0, 38.0]
2   id-3                                  [node-B, node-D]                                 [90.0]
3   id-4                                          [node-A]                                     []
4   id-5  [node-A, node-B, node-C, node-D, node-E, node-D]         [35.0, 30.0, 110.0, 20.0, 5.0]
..                                                 ...                                    ...

The list of value in the nodes columns are the nodes of a graph, and the value in the traveltimes column are the time between two nodes. Each row corresponding to a route in the graph.

I want to split my routes on a threshold value of traveltimes. For example, for a threshold of 70, I want to get the following result :

      id     route_id                            nodes                            traveltimes
0     id-1          0                 [node-A, node-B]                                  [6.0]
1     id-2          0                 [node-A, node-C]                                  [4.0]        
2     id-2          1                 [node-D, node-E]                                 [38.0]
3     id-3          0                         [node-B]                                     []
4     id-3          1                         [node-D]                                     []
5     id-4          0                         [node-A]                                     []
6     id-5          0         [node-A, node-B, node-C]                           [35.0, 30.0]
7     id-5          1         [node-D, node-E, node-D]                            [20.0, 5.0]
..                                                 ...                                    ...

I made the following code that do what I want, but in an inefficient way.

I have a function that split the routes:

def split_routes(row):
    newrow = row.copy()

    threshold = 70

    nodes = newrow['nodes']
    traveltimes = newrow['traveltimes']

    rows = []
    route_id = 0
    route_nodes = []
    route_traveltimes = []

    route_nodes.append(nodes[0])

    for i in range(1, len(nodes)):
        if(traveltimes[i-1]<threshold):
            route_traveltimes.append(traveltimes[i-1])
            route_nodes.append(nodes[i])
        else : 
            # Route route_id completed, starting a new one
            newrow['route_id'] = route_id
            newrow['nodes'] = route_nodes
            newrow['traveltimes'] = route_traveltimes
            rows.append(newrow)

            newrow = row.copy()
            route_nodes = []
            route_traveltimes = []
            route_id+=1
            route_nodes.append(nodes[i])

    # Route route_id completed     
    newrow['route_id'] = route_id
    newrow['nodes'] = route_nodes
    newrow['traveltimes'] = route_traveltimes
    rows.append(newrow)

    df = pd.DataFrame(rows)
    return df

And this is how I use it :

splitted_routes_array = []

for index, row in routes.iterrows():    # Inefficient loop
    splitted_routes_array.append(split_routes(row))

splitted_routes = pd.concat(splitted_routes_array).reset_index(drop=True)

I guess I can do something way more efficient without iterating on rows by myself. But I couldn't figure out how to use apply to return multiple rows and columns at the same time.

Can someone give me some hints on that ?

Nakeuh
  • 1,757
  • 3
  • 26
  • 65
  • This could help you https://stackoverflow.com/a/35208597/1491350 – Ashutosh Dubey Feb 19 '20 at 11:12
  • I guess it is close to the solution of my problem. But when I use this I get a weird result. `splitted_routes = routes.apply(split_routes,axis=1)` give me a serie as output, where each element seems to contain a dataframe. – Nakeuh Feb 19 '20 at 12:56
  • You can try using stack() and reset_index() as suggested in answer. – Pankhuri Agarwal Feb 20 '20 at 11:13

1 Answers1

0

To explode multiple columns in pandas the only prerequisite is having same number of elements in list in all columns to be exploded. This could be done by -

def get_nodes(x):
    if(len(x)<2):
        return []
    return [[x[i], x[i+1]] for i in range(len(x)-1)]

df['nodes'] = df['nodes'].apply(lambda x: get_nodes(x))

After this the data can be flattened using -

df = df.set_index('id').apply(lambda x: x.apply(pd.Series).stack()).reset_index().rename(columns={'level_1':'route_id'})

To find all the routes having traveltimes greater than 70.0, we could simply do -

df[df['traveltimes']>70]
Pankhuri Agarwal
  • 764
  • 3
  • 23