Managing large Pandas Dataframes without much latency

Question

I got some csv-s that i need to edit based on some checks. The problem is that, some of the csvs are very large (arround 40 000) lines, and sometimes my program is running for hours to complete the needed checks. Bellow you'll find a part from my code:

sample input:

   Status    Date
1  Active    12/03/2020
2  Locked    12/03/2020
3  Active    NaN


for i in range(len(df)):

    if type(df.at[i,'Date']) == float:
        aa = df.loc[[i]]
        newdf = newdf.append(aa)
        df = df.drop([i])

df = df.reset_index(drop=True)
print("Passed date check")

for i in range(len(df)):

    if "ACTIVE" not in df.at[i,'Status']:
        aa = df.loc[[i]]
        newdf = newdf.append(aa)
        df = df.drop([i])

print(newdf)

output:

  Status    Date
  Locked    12/03/2020
  Active    NaN

I got a few more loops, like those ones. How i can rewrite the code so it will proccess those csvs faster ?

Please post a sample of your input data, and what your data should look like at the end. — Dave, May 14 '20 at 13:17
Does this help you ? https://stackoverflow.com/questions/44140489/get-non-numerical-rows-in-a-column-pandas-python — Ehsan, May 14 '20 at 13:19
@Ehsan, i think it doesn't, because the checks is not for numerical cells. — Alex Kalaidjiev, May 14 '20 at 13:32
It looks like you are splitting a dataframe in two parts according to rather simple tests. That can be done either directly at csv file level with the csv module to easily process file larger than available memory, or at pandas level for smaller files but with vectorized methods instead of iterating the dataframe. Can you confirm that your goal is just splitting? — Serge Ballesta, May 14 '20 at 13:41
@SergeBallesta , MY goal is to keep the rows that doesn't have any date, or the rows that are not active in the status column. — Alex Kalaidjiev, May 14 '20 at 13:48

score 0 · Answer 1 · answered May 14 '20 at 14:02

If you want to keep the rows have no date or are not marked active in the status column, you can just ask pandas for that:

newdf = df[df['Date'].isna()|(df['Status']!='Active')]

With you sample data, it gives as expected:

   Status        Date
2  Locked  12/03/2020
3  Active         NaN

score 0 · Answer 2 · answered May 14 '20 at 14:06

0

If you have any unformatted/misformatted Date it will detect it as well:

df.loc[(pd.to_datetime(df['Date'], errors='coerce').isnull()) | (df['Status'] != 'Active')]

results in :

    Status  Date
1   Locked  2020-12-03
2   Active  NaT

answered May 14 '20 at 14:06

Ehsan

711
2
7
21

Managing large Pandas Dataframes without much latency

2 Answers2