0

I have a list say: list=['199.72.81.55', 'burger.letters.com']. All I want now, is to get matching values from my dataframe. For example: when I search burger.letters.com my dataframe should return host, timestamps for burger.letters.com. I tried doing this way: df.ix[host] for host in list However, since I have 0.4 billion rows just performing forloop over df.ix[host] it takes more than 30min.

And it takes forever when I run below code.

Below is what my dataframe looks like:

    host                     timestamp
0    199.72.81.55             01/Jul/1995:00:00:01
2    199.72.81.55             01/Jul/1995:00:00:09
3    burger.letters.com     01/Jul/1995:00:00:11
4    199.72.81.55             01/Jul/1995:00:00:12
5    199.72.81.55             01/Jul/1995:00:00:13
6    199.72.81.55             01/Jul/1995:00:00:14
8    burger.letters.com     01/Jul/1995:00:00:15
9    199.72.81.55             01/Jul/1995:00:00:15

I want my desired output like this:

for host in hostlist:
    df.ix[host]

So this operation returns below: but too heavy as I have 0.4 billion rows. And want to optimize this.

df.ix['burger.letters.com']
       host                  timestamp
    3    burger.letters.com     01/Jul/1995:00:00:11
    8    burger.letters.com     01/Jul/1995:00:00:15

df.ix['199.72.81.55']
       host                  timestamp
    0    199.72.81.55             01/Jul/1995:00:00:01
    2    199.72.81.55             01/Jul/1995:00:00:09
    4    199.72.81.55             01/Jul/1995:00:00:12
    5    199.72.81.55             01/Jul/1995:00:00:13
    6    199.72.81.55             01/Jul/1995:00:00:14
    9    199.72.81.55             01/Jul/1995:00:00:15

Below is my code: //takes more than 30minutes

list(map(block, failedIP_list))

    def block(host):
        temp_df = failedIP_df.ix[host]
        if len(temp_df) > 3:
            time_values = temp_df.set_index(keys='index')['timestamp']
            if (return_seconds(time_values[2:3].values[0]) - return_seconds(time_values[0:1].values[0]))<=20:
                blocked_host.append(time_values[3:].index.tolist())

I would really appreciate if anyone can help.

jubins
  • 317
  • 2
  • 7
  • 18
  • 2
    I dont understand what you're trying to get out. – Henry Apr 03 '17 at 17:32
  • `failedIP_list.groupby(['host'].apply()` – Paul H Apr 03 '17 at 17:33
  • in other words, you almost never need to loop through dataframes – Paul H Apr 03 '17 at 17:33
  • So is the intended output a dataframe of only the failedIP_df items (index, timestamp) that are also in failedIP_list? And appear more than three times in failedIP_df? – Jammeth_Q Apr 03 '17 at 17:38
  • @PaulH Actually, `.apply` will generally *not* be faster than an equivalent for-loop. Essentially, `apply` wraps a Python for-loop with some additional overhead, and is usually slower than the equilvanet for-loop. See [this](http://stackoverflow.com/a/38938507/5014455) answer. It is unclear what the OP is trying to accomplish, but likely they want something to do with `groupby`. – juanpa.arrivillaga Apr 03 '17 at 17:46
  • @juanpa.arrivillaga the difference is that I think the repetitive subsetting will be more performant w/ `groupby`. I agree that using a raw `apply` along the rows is about as a fast as a plain `loop`. – Paul H Apr 03 '17 at 18:24
  • 1
    @PaulH yeah, certainly `groupby` should speed things up significantly. However, it is hard to say since the for-loop is iterating over `failedIP_list`, which we don't know the contents.... (e.g. is it only a few IPs? Then parhaps it *is* faster than grouping by all IPs...). Hard to say. The OP hasn't given sufficient information or a useful, [reproducible](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) example. – juanpa.arrivillaga Apr 03 '17 at 18:27
  • So what class is this for? (as I saw this http://stackoverflow.com/questions/43218350/pandas-how-can-i-optimize-my-code ) – Back2Basics Apr 04 '17 at 22:01

1 Answers1

1

Your question is very vague. Here's what I think you want:

def my_function(df):
    # this function should operate on a dataframe
    # that is a subset of your original
    return dfcopy

new_df = (
    df.groupby(by=['host'])
      .filter(lambda g: g.shape[0] > 3
      .groupby(by=['host'])
      .apply(my_function)
)

The groupby/filter will remove groups with less than 3 items. Then we groupby/apply to operate on all of the remaining groups with the same host value.

Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Hi, I need the elements returned by `failedIP_df.ix[[host]]`, where the host is coming from `for host in failedIP_list`. So I managed to bring time down to 29 min for 0.4 billion rows in my dataframe. But it still takes 29min. Can I further optimize it? Thanks! – jubins Apr 04 '17 at 23:03
  • I have a list say: `list=['199.72.81.55', 'burger.letters.com']`. All I want now, is to get matching values from my dataframe. For example: when I search `burger.letters.com` my dataframe should return host, timestamps for `burger.letters.com`. I tried doing this way: `df.ix[host] for host in list` However, since I have 0.4 billion rows just performing forloop over `df.ix[host]` it takes more than 30min. I have updated my question now. Very sorry for the confusion and trouble. – jubins Apr 05 '17 at 15:44
  • did you try the `groupby` approach or are you still looping? – Paul H Apr 05 '17 at 15:57
  • I tried doing like this `failedIP_df = failedIP_df.groupby(by=['host']).filter(lambda g: g.shape[0]>3)` but to get all values from my dataframe for matching hosts. I still need to loop results of failedIP_df. I cant think of any other way to get my desired output even after groupby or mayby I'm doing something wrong. – jubins Apr 05 '17 at 16:02
  • my function is there at the end of question – jubins Apr 05 '17 at 16:05
  • it doesn't return anything, what do you expect to happen? – Paul H Apr 05 '17 at 16:13