0

Apologies for the messy title: Problem as follows:

I have some data frame of the form:

df1 = 
    Entries
0    "A Level"
1    "GCSE"
2    "BSC"

I also have a data frame of the form:

df2 = 
    Secondary    Undergrad
0    "A Level"    "BSC"
1    "GCSE"       "BA"
2    "AS Level"   "MSc"

I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1

return Attribute

I apply this function over all entries in df1, and all columns in df2, using the following iteration:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

This yields an output df which looks something like:

df1 = 
    Entries    Secondary     undergrad
0   "A Level"     1              0
1    "GCSE"       1              0
2    "AS Level"   1              0

I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:

Desired dataframe:

    Entries        Secondary  Words Found          undergrad  Words Found
0   "A Level"     1            "A Level"                0
1    "GCSE"       1            "GCSE"                   0
2    "AS Level"   1            "AS Level"               0

If I do:

def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []    
    for i in words:    
    match_object = re.search(i,df)
    if match_object:
        Words_Present.append(i)
        Yes = 1    
    else:
        No = 0
if Yes == 1:
    Attribute = 1
if Yes == 0:
    Attribute = 0

return Attribute,Words_Present

My function therefore now has multiple outputs. So applying the following:

for i in df2:

    terms = df2[i].values.tolist()

    df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))

My Output Looks like this:

    Entries        Secondary                   undergrad
0   "A Level"     [1,"A Level"]                0
1    "GCSE"       [1, "GCSE"]                  0
2    "AS Level"   [1, "AS Level"]              0

The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.

Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?

I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:

Applying function with multiple arguments to create a new pandas column

Create multiple columns in Pandas Dataframe from one function

Apply pandas function to column to create multiple new columns?

For example, I have also tried:

for i in df2:

    terms = df2[i].values.tolist()

    [df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])

but this simply yields errors such as:

raise ValueError('Length of values does not match length of ' 'index')

Is there a way to use apply, but still extract the extra information directly into multiple columns?

Many thanks, apologies for the length.

Community
  • 1
  • 1
Chuck
  • 3,664
  • 7
  • 42
  • 76
  • 1
    I've answered your question, but I'd like to make a side remark about expressions like `df1['course'][0:1]`: This is called chained indexing, which should be avoided in pandas because it's slow, and often leads to problems when you want to change an element in the dataframe. Instead, `df1.loc[0:1,'course']` is the correct way to refer to the same thing. – Ken Wei Dec 02 '16 at 16:01
  • 1
    (Also, to nitpick, your `word_search()` function will return an error if no matching word is found, because `Attribute` is not assigned unless `Yes == 1`!) – Ken Wei Dec 02 '16 at 16:04
  • Your first point: Thank you, I did not know, and was writing this code based on a file I inherited. Your second point: I made a mistake in copying my code: I had meant to include in both parts of the if statement. I missed it in the rush to get this up. Thank you for pointing both out. [Both are now updated] – Chuck Dec 02 '16 at 16:06

1 Answers1

1

The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().

However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.

It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:

df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1

EDIT: In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.

However, if you have exact matches, you can do everything by a combination of pivots and joins:

keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
    df[colname] = (df.Entries == colname) * 1 # Your indicator variable
    df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']

The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.

Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!

For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:

secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
    if s.Entries.isin(secondary_words):
        return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
    elif s.Entries.isin(undergrad_words ):
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
    else:
        return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})

This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

Ken Wei
  • 3,020
  • 1
  • 10
  • 30
  • Ken You are correct in your deduction of "It looks like you are trying to classify Entries into Secondary or Undergrad". That is entirely correct. The number of categories is however much longer (Secondary, undergrad, Postgrad... etc. ) and even more so for the associated words for each category. That was my reason for using the for loop. There actually may be a case where there are multiple key words registered - I want to save all of these words in a list of strings for each entry in df1. How might this change what you have written? **Thank you so much for your help :)** – Chuck Dec 02 '16 at 15:59
  • In fact, there are so many entries, I will have to do it automatically, as I do not have time to write out `...str.extract('(A Level|GCSE|AS Level)')` for each of the texts in each category. Though, I should be able to make this more automated and include some of the Re module to make this work? – Chuck Dec 02 '16 at 16:11
  • 1
    Do you expect to get exact matches? i.e. the entries are exactly equal to 'A level' or 'GCSE', etc., and so cases like, say, 'GCSE-O' would not have a match (assuming 'GCSE-O' is not in your table of names)? In that case, there's an even cleaner way of doing things. – Ken Wei Dec 02 '16 at 16:15
  • I will develop a range of regular expressions for each category - a lot of these will just be the full word e.g. `List = ["A\sLevel", "A","Level","GCSE", "General\sCertificate\sof\sEducation"...]` However some will be the form of a regular expression e.g. to catch all the different varieties of the word A level. My Desired Dataframe (as shown above) is actually over 30000 rows. I want to save everything as I do it (due to changes I may make later). **I am absolutely open to suggestions!** – Chuck Dec 02 '16 at 16:28
  • And yes, hopefully exact matches, but not always. Some data cleaning has been necessary. – Chuck Dec 02 '16 at 16:31
  • 1
    I've edited my answer. You should start by cleaning the data: do `df1.Entries.unique()` to keep track of all the possibilities, then use a dictionary to define your replacements, e.g. `repl_dict = {'A-Level':'A Level', 'ALevel': 'A Level'}` will replace any occurrence of 'A-Level' or 'ALevel' with 'A Level'. Then do `df1.replace({'Entries':repl_dict}, inplace = True)`. – Ken Wei Dec 02 '16 at 16:43
  • 1
    A pragmatic alternative is to move everything into the same form: e.g. start with `df['Entries'] = df.Entries.str.lower().str.replace({'-':''})`, which for example, replaces 'A-level', 'A level' and 'Alevel' with 'alevel'. Then just make one replacement of 'alevel' with 'A Level' and you're done. (I'd recommend this--try removing all non-alphabetic characters, and see how many unique entries there are. These are the number of replacements you need to type out, which will most likely be fewer than explicitly identifying every variation.) – Ken Wei Dec 02 '16 at 17:10
  • Thank you Ken, I do currently perform some initial data cleaning. While I did intend to do that conversion (or at least create some general regex expressions to catch all variations for each case of A level and similar items) I really like your idea of doing it beforehand on the original data set - getting it all into the same format. This will act somewhat like the function of stemming. Thank you for your suggestion. I will try and implement some of your code suggestions and get back to you. – Chuck Dec 03 '16 at 21:42
  • Hi Ken. Unfortunately it still won't work. I have posted my problem in a new question. – Chuck Dec 05 '16 at 17:17
  • http://stackoverflow.com/questions/40980096/why-do-i-get-an-assertionerror-when-using-str-findall-in-pandas-python – Chuck Dec 05 '16 at 17:38