Pandas: How can I make my code run faster?

Question

I have a list say: list=['199.72.81.55', 'burger.letters.com']. All I want now, is to get matching values from my dataframe. For example: when I search burger.letters.com my dataframe should return host, timestamps for burger.letters.com. I tried doing this way: df.ix[host] for host in list However, since I have 0.4 billion rows just performing forloop over df.ix[host] it takes more than 30min.

And it takes forever when I run below code.

Below is what my dataframe looks like:

    host                     timestamp
0    199.72.81.55             01/Jul/1995:00:00:01
2    199.72.81.55             01/Jul/1995:00:00:09
3    burger.letters.com     01/Jul/1995:00:00:11
4    199.72.81.55             01/Jul/1995:00:00:12
5    199.72.81.55             01/Jul/1995:00:00:13
6    199.72.81.55             01/Jul/1995:00:00:14
8    burger.letters.com     01/Jul/1995:00:00:15
9    199.72.81.55             01/Jul/1995:00:00:15

I want my desired output like this:

for host in hostlist:
    df.ix[host]

So this operation returns below: but too heavy as I have 0.4 billion rows. And want to optimize this.

df.ix['burger.letters.com']
       host                  timestamp
    3    burger.letters.com     01/Jul/1995:00:00:11
    8    burger.letters.com     01/Jul/1995:00:00:15

df.ix['199.72.81.55']
       host                  timestamp
    0    199.72.81.55             01/Jul/1995:00:00:01
    2    199.72.81.55             01/Jul/1995:00:00:09
    4    199.72.81.55             01/Jul/1995:00:00:12
    5    199.72.81.55             01/Jul/1995:00:00:13
    6    199.72.81.55             01/Jul/1995:00:00:14
    9    199.72.81.55             01/Jul/1995:00:00:15

Below is my code: //takes more than 30minutes

list(map(block, failedIP_list))

    def block(host):
        temp_df = failedIP_df.ix[host]
        if len(temp_df) > 3:
            time_values = temp_df.set_index(keys='index')['timestamp']
            if (return_seconds(time_values[2:3].values[0]) - return_seconds(time_values[0:1].values[0]))<=20:
                blocked_host.append(time_values[3:].index.tolist())

I would really appreciate if anyone can help.

in other words, you almost never need to loop through dataframes — Paul H, Apr 03 '17 at 17:33
So is the intended output a dataframe of only the failedIP_df items (index, timestamp) that are also in failedIP_list? And appear more than three times in failedIP_df? — Jammeth_Q, Apr 03 '17 at 17:38
@PaulH Actually, `.apply` will generally *not* be faster than an equivalent for-loop. Essentially, `apply` wraps a Python for-loop with some additional overhead, and is usually slower than the equilvanet for-loop. See [this](http://stackoverflow.com/a/38938507/5014455) answer. It is unclear what the OP is trying to accomplish, but likely they want something to do with `groupby`. — juanpa.arrivillaga, Apr 03 '17 at 17:46
@juanpa.arrivillaga the difference is that I think the repetitive subsetting will be more performant w/ `groupby`. I agree that using a raw `apply` along the rows is about as a fast as a plain `loop`. — Paul H, Apr 03 '17 at 18:24
@PaulH yeah, certainly `groupby` should speed things up significantly. However, it is hard to say since the for-loop is iterating over `failedIP_list`, which we don't know the contents.... (e.g. is it only a few IPs? Then parhaps it *is* faster than grouping by all IPs...). Hard to say. The OP hasn't given sufficient information or a useful, [reproducible](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) example. — juanpa.arrivillaga, Apr 03 '17 at 18:27
So what class is this for? (as I saw this http://stackoverflow.com/questions/43218350/pandas-how-can-i-optimize-my-code ) — Back2Basics, Apr 04 '17 at 22:01

Paul H · Answer 1 · 2017-04-03T18:31:30.180

1

Your question is very vague. Here's what I think you want:

def my_function(df):
    # this function should operate on a dataframe
    # that is a subset of your original
    return dfcopy

new_df = (
    df.groupby(by=['host'])
      .filter(lambda g: g.shape[0] > 3
      .groupby(by=['host'])
      .apply(my_function)
)

The groupby/filter will remove groups with less than 3 items. Then we groupby/apply to operate on all of the remaining groups with the same host value.

edited Apr 03 '17 at 18:31

answered Apr 03 '17 at 17:36

Paul H

65,268
20
159
136

Hi, I need the elements returned by `failedIP_df.ix[[host]]`, where the host is coming from `for host in failedIP_list`. So I managed to bring time down to 29 min for 0.4 billion rows in my dataframe. But it still takes 29min. Can I further optimize it? Thanks! – jubins Apr 04 '17 at 23:03
I have a list say: `list=['199.72.81.55', 'burger.letters.com']`. All I want now, is to get matching values from my dataframe. For example: when I search `burger.letters.com` my dataframe should return host, timestamps for `burger.letters.com`. I tried doing this way: `df.ix[host] for host in list` However, since I have 0.4 billion rows just performing forloop over `df.ix[host]` it takes more than 30min. I have updated my question now. Very sorry for the confusion and trouble. – jubins Apr 05 '17 at 15:44
did you try the `groupby` approach or are you still looping? – Paul H Apr 05 '17 at 15:57
I tried doing like this `failedIP_df = failedIP_df.groupby(by=['host']).filter(lambda g: g.shape[0]>3)` but to get all values from my dataframe for matching hosts. I still need to loop results of failedIP_df. I cant think of any other way to get my desired output even after groupby or mayby I'm doing something wrong. – jubins Apr 05 '17 at 16:02
my function is there at the end of question – jubins Apr 05 '17 at 16:05
it doesn't return anything, what do you expect to happen? – Paul H Apr 05 '17 at 16:13

Pandas: How can I make my code run faster?

1 Answers1