4

I want to use pandas to read a csv file that contains nodes and their attributes. Not all nodes have every attribute, and missing attributes are simply missing from the csv file. When pandas reads the csv file, the missing values appear as nan. I want to add the nodes in bulk from the dataframe, but avoid adding attributes that are nan.

For example, here is a sample csv file called mwe.csv:

Name,Cost,Depth,Class,Mean,SD,CST,SL,Time
Manuf_0001,39.00,1,Manuf,,,12,,10.00
Manuf_0002,36.00,1,Manuf,,,8,,10.00
Part_0001,12.00,2,Part,,,,,28.00
Part_0002,5.00,2,Part,,,,,15.00
Part_0003,9.00,2,Part,,,,,10.00
Retail_0001,0.00,0,Retail,253,36.62,0,0.95,0.00
Retail_0002,0.00,0,Retail,45,1,0,0.95,0.00
Retail_0003,0.00,0,Retail,75,2,0,0.95,0.00

Here's how I'm currently handling this:

import pandas as pd
import numpy as np
import networkx as nx

node_df = pd.read_csv('mwe.csv')

graph = nx.DiGraph()
graph.add_nodes_from(node_df['Name'])
nx.set_node_attributes(graph, dict(zip(node_df['Name'], node_df['Cost'])), 'nodeCost')
nx.set_node_attributes(graph, dict(zip(node_df['Name'], node_df['Mean'])), 'avgDemand')
nx.set_node_attributes(graph, dict(zip(node_df['Name'], node_df['SD'])), 'sdDemand')
nx.set_node_attributes(graph, dict(zip(node_df['Name'], node_df['CST'])), 'servTime')
nx.set_node_attributes(graph, dict(zip(node_df['Name'], node_df['SL'])), 'servLevel')

# Loop through all nodes and all attributes and remove NaNs.
for i in graph.nodes:
    for k, v in list(graph.nodes[i].items()):
        if np.isnan(v):
            del graph.nodes[i][k]

It works, but it's clunky. Is there a better way, e.g., a way to avoid the nans when adding the nodes, rather than deleting the nans afterwards?

LarrySnyder610
  • 2,277
  • 12
  • 24

2 Answers2

2

You can leverage the power of Pandas to do your bidding in this case. So, I have created this function, which converts your DataFrame with two key and value columns to a series, then drop elements with NaNs, and finally changes it to a dictionary

def create_node_attribs(key_col, val_col):
    # Upto you if you want to pass the dataframe as argument
    # In your case, since this was the only df, I only passed the columns
    global node_df
    return Series(node_df[val_col].values,
                  index=node_df[key_col]).dropna().to_dict()

Here is the complete code

import pandas as pd
import networkx as nx
from pandas import Series

node_df = pd.read_csv('mwe.csv')

graph = nx.DiGraph()

def create_node_attribs(key_col, val_col):
    # Upto you if you want to pass the dataframe as argument
    # In your case, since this was the only df, I only passed the columns
    global node_df
    return Series(node_df[val_col].values,
                  index=node_df[key_col]).dropna().to_dict()

graph.add_nodes_from(node_df['Name'])
nx.set_node_attributes(graph, create_node_attribs('Name', 'Cost'), 'nodeCost')
nx.set_node_attributes(graph, create_node_attribs('Name', 'Mean'), 'avgDemand')
nx.set_node_attributes(graph, create_node_attribs('Name', 'SD'), 'sdDemand')
nx.set_node_attributes(graph, create_node_attribs('Name', 'CST'), 'servTime')
nx.set_node_attributes(graph, create_node_attribs('Name', 'SL'), 'servLevel')

Link to Google Colab Notebook with the code.

Also, see this answer, for more information about time comparison of the current method used.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
0

Use keep_default_na when importing your csv to Pandas:

pd.read_csv('data.csv', keep_default_na=False)

Get pandas.read_csv to read empty values as empty string instead of nan

crocefisso
  • 793
  • 2
  • 14
  • 29