0

In a nutshell, I am trying to apply my function to select row. I have it working by subsetting the dataframe, run the function and then merge the subset back to the main dataframe. However, that is cumbersome and there has to be more efficient solution that escapes me. I found several useful posts (here, here and here) that helped improve my code.

Here is a sample dataframe:

data = {'firm': ['Smith', 'Jones', 'Smith New York', 'Jones International', 'Winter'], 
        'id': [np.nan, 732, 216, np.nan, 1714], 
        'url1': ['url', np.nan, 'url', 'url', 'url'],
        'url2': ['url', 'url', 'url', np.nan, 'url'],
        'text': ['foo', 'bar', np.nan, np.nan, 'foo bar']}
df = pd.DataFrame(data)

The below function will parse the website whereby the user can set the keyword to search already downloaded files and use that stored data if present. If the last crawl happened a while a go a new crawl for an updated website is needed.

def fetch(id, url, **kwargs):
    if backup == 'Yes':
        print('Fetching {} from {}'.format(id, url))
        # Actual fetching code
    else:
        print('Loading stored data for {}'.format(id))
        # Actual loading code 

The function works as I tested it on individual URLs, but I run into problems when I try to apply it. I have multiple conditions when to run it. Currently I use them to subset the dataframe. Note: if two urls are present, url1 is preferred. Following Pandas documentation keyword arguments can be submitted. Initially I tried np.where. There are 4 conditions in total, below are two:

df['content'] = np.where(df['text'].isna() & df['url1'].notnull() &
                            df['url2'].notnull() & df['firm'].str.contains('Smith'),
                         df['url1'].apply(fetch, args=df['id'], backup='Yes'),
                         np.where(df['text'].isna() & df['url1'].notnull() & 
                                    df['url2'].isna() & df['firm'].str.contains('Smith'),
                                  df['url1'].apply(**fetch, backup='Yes'**),
                                  pd.np.nan))
TypeError: fetch() takes 2 positional arguments but --some other number-- were given

Hence, adding pandas series does not work. And I cannot figure out how to add it as a scalar. Another failed approach with only two of the columns/series:

df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].notnull() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch nothing
df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
    df['url2'].isna() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch one
TypeError: ("fetch() missing 1 required positional argument: 'url1'", 'occurred at index id')

And finally I tried lambda:

df['text'].where(fd['text'].isna() & df['url1'].notnull() & df['url2'].isna()
   & df['fidm'].str.contains('Smith'), df[['id', 'url1']].apply(lambda x,y: get_XML(x,y)))
TypeError: ("<lambda>() missing 1 required positional argument: 'y'", 'occurred at index id')

I assume I am missing something simple, but obviously crucial. Any pointers are appreciated.


Edit - Solution


I took comments from Damien Ayers (see below) to heart and simplified the code. This then also put me on the path to the solution:

def get_ft(text, xml, id, url1, url2, firm, backup= 'Yes'):
    if pd.notnull(id):
        if pd.isna(text) and pd.notnull(url1) and (pd.notnull(url2) or pd.isna(url2)):
            if 'Smith' in firm:
                return fetch(id, url1, backup)
            ... code continues

And here the proper use of apply and lambda thanks to this discussion:

df['text_new'] = df.apply(lambda x: x['text'], x['id'], x['url1'],
                                    x['url2'], x['firm'], backup), axis=1)

Much cleaner and more importantly it works.

raummensch
  • 604
  • 2
  • 8
  • 16
  • In your `np.where` attempt there's two spots where `url1` in `df['url]` is missing a closing quote mark. But maybe that's an error that happened when writing the question? – Damien Ayers Apr 22 '19 at 23:18
  • Thanks @DamienAyers. I fixed it here. It did not do the trick, unfortunately. – raummensch Apr 22 '19 at 23:21
  • When the conditionals start getting this complicated, it's often worth writing it out in long form as separate functions or named variables, rather than combining so much into a single expression. – Damien Ayers Apr 22 '19 at 23:21
  • The `TypeError: ("fetch() missing 1 required positional argument: 'url'", 'occurred at index id')` error is due to fetch expecting a keyword argument `url`, but pandas is calling it with the keyward `url1`, since that's what's in the dataframe. – Damien Ayers Apr 22 '19 at 23:32
  • There's also some more typos in the code examples, that make it a bit harder to run. `fd` instead of `df`, and `fetch` hasn't defined the `backup` variable and has a `=` instead of `==`. – Damien Ayers Apr 22 '19 at 23:34
  • I tried to copy an abridged version of the code to simplify it. I will fix these mistakes. – raummensch Apr 22 '19 at 23:36
  • Done. As for your "backup" comment. The way I understand it is that **kwargs allows for various keywords; here I try to pass "backup". – raummensch Apr 22 '19 at 23:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192241/discussion-between-damien-ayers-and-raummensch). – Damien Ayers Apr 23 '19 at 00:16
  • Great idea... there. – raummensch Apr 23 '19 at 00:30

0 Answers0