0

I am after the very first word before v and after 'v'.

df = pd.DataFrame({'text': ["cans choc v macroni ice", 
                            "chocolate sundaes v chocolate ice cream", 
                            "Chocolate v sauce"]})

I have a dataframe that looks like:

cans choc v macroni ice
chocolate sundaes v chocolate ice cream
Chocolate v sauce

I want it to look like:

cans v macroni
chocolate v chocolate
Chocolate v sauce

How can this be achieved in pandas? The common element is 'v'.

  • can you please clarify which words you are looking to extract as your first example conflicts with your second. – Will Dec 07 '17 at 02:53
  • Yes, @Will is correct. You say that you want the ***FIRST*** word after each `v`, which means that the first entry should say `cans v macroni`, instead of `cans v ice`, as you've written. – Mike Williamson Dec 07 '17 at 02:58
  • @MikeWilliamson Actually I need cans v ice. Terrible miscommunication on my part. –  Dec 07 '17 at 03:00
  • @gerrybro Your second example should be "chocolate v cream" if your first is "cans v ice", no? – Will Dec 07 '17 at 03:07

4 Answers4

1

Is there a reason you cannot use the split function and then map the function to the column?

As per the first example, this will work:

def word_scrape(whole_string):
    outside_v = whole_string.split(" v ")
    first_word = outside_v[0].split(" ")[0]
    last_word = outside_v[1].split(" ")[1]
    return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)

for fault tolerance for single word entries, use:

def word_scrape(whole_string):
    try:
        outside_v = whole_string.split(" v ")
        first_word = outside_v[0].split(" ")[0]
        last_word = outside_v[1].split(" ")[1]
        return first_word + " v " + last_word
    except: 
        outside_v = whole_string.split(" v ")
        first_word = outside_v[0].split(" ")[0]
        last_word = outside_v[1].split(" ")[0]
        return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)

As per the second example, this will work:

def word_scrape(whole_string):
    outside_v = whole_string.split(" v ")
    first_word = outside_v[0].split(" ")[0]
    last_word = outside_v[1].split(" ")[0]
    return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)
Will
  • 339
  • 3
  • 7
1

You can use regular expressions, as @James suggests. But here's another way, using pandas apply, which more generically handles the question at hand.

(BTW, there are several very similar questions and answers, such as this one.)

>>> def my_fun(my_text, my_sep):
>>>   vals = my_text.split(my_sep)
>>>   vals = [val.split()[0] for val in vals]
>>>   return vals

>>> df.text.apply(lambda my_text: my_fun(my_text, 'v'))

Of course, please use better names than this! :-)

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
0

You can pass a regular expression to the string operations on the text columns.

df.text.str.extract(r'(\w+ v \w+)', expand=True)

# returns:
                     0
0       choc v macroni
1  sundaes v chocolate
2    Chocolate v sauce
James
  • 32,991
  • 4
  • 47
  • 70
  • Why is it reading sundaes v chocolate? I'm wanting chocolate v chocolate? –  Dec 07 '17 at 02:51
  • Ok. How is the computer supposed to distinguish which word you want? In other words, why is `chocolate` important, but `sundaes` is not? In your question, you ask how to get the **first word before and after 'v'**. – James Dec 07 '17 at 02:53
  • I do not need that data. –  Dec 07 '17 at 02:56
0

Let's try this:

df.text.str.split('v', expand=True)\
  .apply(lambda x: x.str.extract('(\w+)', expand=False))\
  .apply(lambda x: ' v '.join(x), 1)

Output:

0           cans v macroni
1    chocolate v chocolate
2        Chocolate v sauce
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • TypeError: ('sequence item 2: expected str instance, float found', 'occurred at index 2') - .apply(lambda x: ' v '.join(x), 1) –  Dec 07 '17 at 03:25
  • What does your data look like? You migh need to do a astype(str) in one of those lines to force casting as string. Can you generate a data set that produces this error? – Scott Boston Dec 07 '17 at 03:26
  • For: df1['A'].str.split('v', expand=True)\ .apply(lambda x: x.str.extract('(\w+)', expand=False))\ .apply(lambda x: ' v '.join(x), 1) –  Dec 07 '17 at 03:28