How do I do a search and replace in a very large dataset using python?

Question

I have a large dataset (1 million rows plus) and several GB that changes on a regular basis. It models flow characteristics by relating each entry to its upstream neighbor in a flow network. The basic logic I want in the tool is to use a ID field, search for the related upstream device, and write a number stored in a different column (Num2) of the upstream device entry to the original column. This allows me to determine which "level" of the flow network I am at. Here is a sample of the data:

Here is the code I am using:

import pandas as pd
import numpy as np



# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("L:\\Dev_h\\xxxxx.csv")

#put values into the Number field initially:
df['Num2'] = np.where(df['Num1'] > 0, df['Num1'], df['Num2'])
print(df)

null_num2_rows = df[df["Num2"].isnull()]

# For each row with a null Num2 field, find the row with the same ID and a non-null Num2 field
for row in null_num2_rows.iterrows():
    row_index, row_data = row

    # Get the Upstream device field from the current row
    up_field = row_data["Up"]

    # Find the row with the same ID and a non-null Num2 field
    matched_row = df.loc[(df["DevNo"] == up_field) & ~df["Num2"].isnull()]

    # Set the Num2 field of the current row with the Num2 field of the matched row
    df.at[row_index, "Num2"] = matched_row["Num2"].iloc[0]
print(df)
# Save the DataFrame to an excel
df.to_excel("L:\\ROAD ROUTING\\xxxxx.xlsx")

This seems to work fine as an approach for a very small file; here is an example of the output.

DevNo,Up,Down,Num1,Num2
F1      S1    1     1
F2  S1  S2    2     2
F3  S4  S6    3     3
F4  S8        4     4
S1  F1  F2          1
S2  F2  S4          2
S3  F2  S5          2
S4  S2  F3          2
S5  S3  T1          2
S6  F3  S6          3
S7  S6  S8          3
S8  S7  F4          3

However, it scales terribly on a large dataset, maxing out my memory. I am very new to python so I don't really know how to update my logic to accommodate such a large dataset. Loading chunks in pandas doesn't work because of the potential for matching values to not be located in the same chunk as the search row.

How should I update to better handle a large dataset?

Please [edit] your question and put there sample (small) input and expected output (as text) — Andrej Kesely, Aug 21 '23 at 20:41
The RIGHT answer, of course, is to store all of this in a database in a real database server, like MySQL or Postgres, and stop trying to manage gigabytes of data with Excel. — Tim Roberts, Aug 21 '23 at 20:43
Thank you @TimRoberts! As I'm not the database engineer, I'm forced to work with what I have. Let's just say...ahem...there wasn't a lot of forethought put into this system. However, I'd love any input you have about how better to do this. The to_excel at the end is only becuase of readability; I will probably export to csv in a multi gdb scenario — Jackson Dunn, Aug 21 '23 at 20:53
I don't see a `GUID` field in the data. Can you clarify what that should refer to? — Nick ODell, Aug 21 '23 at 20:55
My bad @NickODell. I scrubbed the real field names for confidentiality and forgot to change everything. Edits made — Jackson Dunn, Aug 21 '23 at 20:58
Question about the characteristics of the data: is the data big because there are lots of entries that are part of different levels, or is it big because the chains in each level are very long? What is the longest chain of entries in your dataset? How many distinct values of Num1 are there in your dataset? — Nick ODell, Aug 21 '23 at 21:02
To Tim Roberts' point: if you're writing a python program to "do the thing" then you should be able to reach for any tool in your toolbag. python has a "batteries included" sqlite3 library, where sqlite can handle up to 281 terrabytes of data in a single database, specializes in fast retrieval and updating, and more efficiently stores field data. So, in other words, pandas may not be the right solution for your job. That being said, pandas lets you [read in chunks](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) — VoNWooDSoN, Aug 21 '23 at 21:12
If you have more than a million rows, why are you saving it in an xlsx file? Excel is unable to deal with more than 1048576 rows. — Anon Coward, Aug 22 '23 at 01:35
@NickODell its big because both actually. It models flow in a multinodal network; so there are both a large number of devices feeding from source nodes (second-8th level devices) and a huge amount of sources in the network as well. — Jackson Dunn, Aug 22 '23 at 11:40

score 1 · Answer 1 · answered Aug 22 '23 at 01:30

Try this:

df = pd.read_csv("data.csv")
df['Num2'] = np.where(df['Num1'] > 0, df['Num1'], df['Num2'])

arr = []
lookup = df.set_index("DevNo")["Num2"].to_dict()

for devno, up, num2 in zip(df["DevNo"], df["Up"], df["Num2"]):
    if not up:
        arr.append(num2)
    elif pd.isna(num2):
        lookup[devno] = lookup[up]
        arr.append(lookup[devno])
    else:
        arr.append(num2)

df["num2"] = arr

Why it's fast:

Only pass through the dataframe once
Use zip instead of df.iterrows: zip create a tuple, which is much simpler than the Series that iterrows creates for each row.
Dictionary lookup (lookup[devno], lookup[up]) is several magnitudes faster than using df.loc

I don't think your problem can be solved by migrating to database. Some operations can be improved, such as fetching the CSV file in chunks to avoid blowing up the memory usage. But since your loop has side effects, any database solution will incur additional write cost as well. With Python, you can keep those writes to memory (the lookup dictionary).

I don't know if this works in cases where the dictionaries contain values that are null. How would I fix the null values? — Jackson Dunn, Aug 22 '23 at 12:04
A dictionary can contain null values. Can you clarify what you mean by fixing it? — Code Different, Aug 22 '23 at 13:33

How do I do a search and replace in a very large dataset using python?

1 Answers1