I have a large dataset (1 million rows plus) and several GB that changes on a regular basis. It models flow characteristics by relating each entry to its upstream neighbor in a flow network. The basic logic I want in the tool is to use a ID field, search for the related upstream device, and write a number stored in a different column (Num2) of the upstream device entry to the original column. This allows me to determine which "level" of the flow network I am at. Here is a sample of the data:
Here is the code I am using:
import pandas as pd
import numpy as np
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("L:\\Dev_h\\xxxxx.csv")
#put values into the Number field initially:
df['Num2'] = np.where(df['Num1'] > 0, df['Num1'], df['Num2'])
print(df)
null_num2_rows = df[df["Num2"].isnull()]
# For each row with a null Num2 field, find the row with the same ID and a non-null Num2 field
for row in null_num2_rows.iterrows():
row_index, row_data = row
# Get the Upstream device field from the current row
up_field = row_data["Up"]
# Find the row with the same ID and a non-null Num2 field
matched_row = df.loc[(df["DevNo"] == up_field) & ~df["Num2"].isnull()]
# Set the Num2 field of the current row with the Num2 field of the matched row
df.at[row_index, "Num2"] = matched_row["Num2"].iloc[0]
print(df)
# Save the DataFrame to an excel
df.to_excel("L:\\ROAD ROUTING\\xxxxx.xlsx")
This seems to work fine as an approach for a very small file; here is an example of the output.
DevNo,Up,Down,Num1,Num2
F1 S1 1 1
F2 S1 S2 2 2
F3 S4 S6 3 3
F4 S8 4 4
S1 F1 F2 1
S2 F2 S4 2
S3 F2 S5 2
S4 S2 F3 2
S5 S3 T1 2
S6 F3 S6 3
S7 S6 S8 3
S8 S7 F4 3
However, it scales terribly on a large dataset, maxing out my memory. I am very new to python so I don't really know how to update my logic to accommodate such a large dataset. Loading chunks in pandas doesn't work because of the potential for matching values to not be located in the same chunk as the search row.
How should I update to better handle a large dataset?