Pandas DataFrame takes too long

Question

I am running the below code on a file with close to 300k lines. I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?

import sys
import numpy as np
import pandas as pd


file = sys.argv[1]

df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]

orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])


size = np.array([])
ts = np.array([])
for i in range(len(df)):
    if orig_bytes[i] > resp_bytes[i]:
        size = np.append(size, orig_bytes[i])
        ts = np.append(ts, df['ts'][i])
    else:
        size = np.append(size, resp_bytes[i])
        ts = np.append(ts, df['ts'][i])

The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.

Thanking you all for your help

I see a lot of `append`, which is not good. Have a look at `np.where`. For example: `size=np.where(orig_bytes > resp_bytes, orgi_bytes, resp_bytes)`, or just `size = np.maxixum (orgi_bytes, resp_bytes)`. — Quang Hoang, Dec 17 '19 at 20:33
Seems like a great use-case for `np.where` or a native pandas function, but for a good answer it would help to see a sample of your input and expected output. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — G. Anderson, Dec 17 '19 at 20:34
Does this answer your question? [Creating a new column depending on the equality of two other columns](https://stackoverflow.com/questions/44067524/creating-a-new-column-depending-on-the-equality-of-two-other-columns) — G. Anderson, Dec 17 '19 at 20:34
@QuangHoang's comment actually took care of the issue for me. Thank you everyone :) — Ndilo, Dec 17 '19 at 21:14

score 0 · Answer 1 · answered Dec 17 '19 at 21:06

I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go. Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:

After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.

df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]

df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]

If you do want to include the entries where they are equal to each other, then just skip the query step.

Thanks, I had already filtered out instances where they were equal. using `np.where` actually worked out quite well. — Ndilo, Dec 17 '19 at 21:16

Pandas DataFrame takes too long

1 Answers1