In a nutshell, I am trying to apply my function to select row. I have it working by subsetting the dataframe, run the function and then merge the subset back to the main dataframe. However, that is cumbersome and there has to be more efficient solution that escapes me. I found several useful posts (here, here and here) that helped improve my code.
Here is a sample dataframe:
data = {'firm': ['Smith', 'Jones', 'Smith New York', 'Jones International', 'Winter'],
'id': [np.nan, 732, 216, np.nan, 1714],
'url1': ['url', np.nan, 'url', 'url', 'url'],
'url2': ['url', 'url', 'url', np.nan, 'url'],
'text': ['foo', 'bar', np.nan, np.nan, 'foo bar']}
df = pd.DataFrame(data)
The below function will parse the website whereby the user can set the keyword to search already downloaded files and use that stored data if present. If the last crawl happened a while a go a new crawl for an updated website is needed.
def fetch(id, url, **kwargs):
if backup == 'Yes':
print('Fetching {} from {}'.format(id, url))
# Actual fetching code
else:
print('Loading stored data for {}'.format(id))
# Actual loading code
The function works as I tested it on individual URLs, but I run into problems when I try to apply it. I have multiple conditions when to run it. Currently I use them to subset the dataframe. Note: if two urls are present, url1 is preferred. Following Pandas documentation keyword arguments can be submitted. Initially I tried np.where
. There are 4 conditions in total, below are two:
df['content'] = np.where(df['text'].isna() & df['url1'].notnull() &
df['url2'].notnull() & df['firm'].str.contains('Smith'),
df['url1'].apply(fetch, args=df['id'], backup='Yes'),
np.where(df['text'].isna() & df['url1'].notnull() &
df['url2'].isna() & df['firm'].str.contains('Smith'),
df['url1'].apply(**fetch, backup='Yes'**),
pd.np.nan))
TypeError: fetch() takes 2 positional arguments but --some other number-- were given
Hence, adding pandas series does not work. And I cannot figure out how to add it as a scalar. Another failed approach with only two of the columns/series:
df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
df['url2'].notnull() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch nothing
df[['id', 'url1']][fd['text'].isna() & df['url1'].notnull() &
df['url2'].isna() & df['firm'].str.contains('Smith')].apply(fetch) # Should fetch one
TypeError: ("fetch() missing 1 required positional argument: 'url1'", 'occurred at index id')
And finally I tried lambda
:
df['text'].where(fd['text'].isna() & df['url1'].notnull() & df['url2'].isna()
& df['fidm'].str.contains('Smith'), df[['id', 'url1']].apply(lambda x,y: get_XML(x,y)))
TypeError: ("<lambda>() missing 1 required positional argument: 'y'", 'occurred at index id')
I assume I am missing something simple, but obviously crucial. Any pointers are appreciated.
Edit - Solution
I took comments from Damien Ayers (see below) to heart and simplified the code. This then also put me on the path to the solution:
def get_ft(text, xml, id, url1, url2, firm, backup= 'Yes'):
if pd.notnull(id):
if pd.isna(text) and pd.notnull(url1) and (pd.notnull(url2) or pd.isna(url2)):
if 'Smith' in firm:
return fetch(id, url1, backup)
... code continues
And here the proper use of apply
and lambda
thanks to this discussion:
df['text_new'] = df.apply(lambda x: x['text'], x['id'], x['url1'],
x['url2'], x['firm'], backup), axis=1)
Much cleaner and more importantly it works.