1

Python 2.7 I am trying to write the result of a "robot check" (although I suppose this applies in other circumstances) where I am iterating over a data frame. I have tried

import robotparser
import urlparse
import pandas as pd
df = pd.DataFrame(dict(A=['http://www.python.org'
                          ,'http://www.junksiteIamtellingyou.com'
                         ]))

df
    A
0   http://www.python.org
1   http://www.junksiteIamtellingyou.com

agent_name = 'Test'
for i in df['A']:
    try:
        parser = robotparser.RobotFileParser()
        parser.set_url(urlparse.urljoin(i,"robots.txt"))
        parser.read()
    except Exception as e:
        df['Robot'] =  'No Robot.txt'
    else:
        df['Robot'] =  parser.can_fetch(agent_name, i)
df
    A                                       Robot
0   http://www.python.org                   No Robot.txt <<<-- NOT CORRECT
1   http://www.junksiteIamtellingyou.com    No Robot.txt

What is happening, of course, is the last value of the iteration is writing over the entire column of values. The value of Robot should be 'True' (which can be demonstrated by deleting the junk URL from the data frame.

I have tried some different permutations of .loc, but can't get them to work. They always seem to add rows as opposed to update the new column for the existing row.

So, is there a way to specify the column being updated (with the function result)? Perhaps using .loc(location), or perhaps there is another way such as using lambda? I would appreciate your help.

Pat Stroh
  • 189
  • 1
  • 3
  • 10

1 Answers1

4

There is an apply for that:

import robotparser
import urlparse
import pandas as pd
df = pd.DataFrame(dict(A=['http://www.python.org'
                          ,'http://www.junksiteIamtellingyou.com']))

def parse(i, agent_name):
    try:
        parser = robotparser.RobotFileParser()
        parser.set_url(urlparse.urljoin(i, "robots.txt"))
        parser.read()
    except Exception as e:
        return 'No Robot.txt'
    else:
        return parser.can_fetch(agent_name, i)

df['Robot'] = df['A'].apply(parse, args=('Test',))
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This works perfectly. Thank you. At first, I thought there was a minor error in your code (it didn't make sense to me that args=('Test',) would not be args=(,'Test') but now I see the iterator comes out of the df['A'].apply part. At least that's how I making sense of it. Well done; thanks again. – Pat Stroh Aug 01 '15 at 14:50
  • The `args` parameter expects a sequence, such as a tuple or list, as its argument. The contents of the sequence are passed as additional positional arguments to `parse`. `('Test',)` is a tuple containing one value. The comma is necessary because Python evaluates `('Test')` as the string `'Test'`. *The comma makes the tuple.* `args = ['Test']` would work, too. – unutbu Aug 01 '15 at 17:05