How to get very first word before and after 'text'

Question

I am after the very first word before v and after 'v'.

df = pd.DataFrame({'text': ["cans choc v macroni ice", 
                            "chocolate sundaes v chocolate ice cream", 
                            "Chocolate v sauce"]})

I have a dataframe that looks like:

cans choc v macroni ice
chocolate sundaes v chocolate ice cream
Chocolate v sauce

I want it to look like:

cans v macroni
chocolate v chocolate
Chocolate v sauce

How can this be achieved in pandas? The common element is 'v'.

can you please clarify which words you are looking to extract as your first example conflicts with your second. — Will, Dec 07 '17 at 02:53
Yes, @Will is correct. You say that you want the ***FIRST*** word after each `v`, which means that the first entry should say `cans v macroni`, instead of `cans v ice`, as you've written. — Mike Williamson, Dec 07 '17 at 02:58
@MikeWilliamson Actually I need cans v ice. Terrible miscommunication on my part. — , Dec 07 '17 at 03:00
@gerrybro Your second example should be "chocolate v cream" if your first is "cans v ice", no? — Will, Dec 07 '17 at 03:07

Will · Answer 1 · 2017-12-07T03:09:13.977

Is there a reason you cannot use the split function and then map the function to the column?

As per the first example, this will work:

def word_scrape(whole_string):
    outside_v = whole_string.split(" v ")
    first_word = outside_v[0].split(" ")[0]
    last_word = outside_v[1].split(" ")[1]
    return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)

for fault tolerance for single word entries, use:

def word_scrape(whole_string):
    try:
        outside_v = whole_string.split(" v ")
        first_word = outside_v[0].split(" ")[0]
        last_word = outside_v[1].split(" ")[1]
        return first_word + " v " + last_word
    except: 
        outside_v = whole_string.split(" v ")
        first_word = outside_v[0].split(" ")[0]
        last_word = outside_v[1].split(" ")[0]
        return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)

As per the second example, this will work:

def word_scrape(whole_string):
    outside_v = whole_string.split(" v ")
    first_word = outside_v[0].split(" ")[0]
    last_word = outside_v[1].split(" ")[0]
    return first_word + " v " + last_word

for i,text in enumerate(df.ix[:,'text']):
    df.ix[i,'text'] = word_scrape(text)

Does this not give you "cans v ice"? (which FYI are the first and last words, not both the first words) — Will, Dec 07 '17 at 03:06

score 1 · Answer 2 · answered Dec 07 '17 at 02:54

You can use regular expressions, as @James suggests. But here's another way, using pandas apply, which more generically handles the question at hand.

(BTW, there are several very similar questions and answers, such as this one.)

>>> def my_fun(my_text, my_sep):
>>>   vals = my_text.split(my_sep)
>>>   vals = [val.split()[0] for val in vals]
>>>   return vals

>>> df.text.apply(lambda my_text: my_fun(my_text, 'v'))

Of course, please use better names than this! :-)

score 0 · Answer 3 · answered Dec 07 '17 at 02:49

0

You can pass a regular expression to the string operations on the text columns.

df.text.str.extract(r'(\w+ v \w+)', expand=True)

# returns:
                     0
0       choc v macroni
1  sundaes v chocolate
2    Chocolate v sauce

answered Dec 07 '17 at 02:49

James

32,991
4
47
70

Why is it reading sundaes v chocolate? I'm wanting chocolate v chocolate? – Dec 07 '17 at 02:51
Ok. How is the computer supposed to distinguish which word you want? In other words, why is `chocolate` important, but `sundaes` is not? In your question, you ask how to get the **first word before and after 'v'**. – James Dec 07 '17 at 02:53
I do not need that data. – Dec 07 '17 at 02:56

score 0 · Accepted Answer · answered Dec 07 '17 at 03:08

0

Let's try this:

df.text.str.split('v', expand=True)\
  .apply(lambda x: x.str.extract('(\w+)', expand=False))\
  .apply(lambda x: ' v '.join(x), 1)

Output:

0           cans v macroni
1    chocolate v chocolate
2        Chocolate v sauce

answered Dec 07 '17 at 03:08

Scott Boston

147,308
15
139
187

TypeError: ('sequence item 2: expected str instance, float found', 'occurred at index 2') - .apply(lambda x: ' v '.join(x), 1) – Dec 07 '17 at 03:25
What does your data look like? You migh need to do a astype(str) in one of those lines to force casting as string. Can you generate a data set that produces this error? – Scott Boston Dec 07 '17 at 03:26
For: df1['A'].str.split('v', expand=True)\ .apply(lambda x: x.str.extract('(\w+)', expand=False))\ .apply(lambda x: ' v '.join(x), 1) – Dec 07 '17 at 03:28

How to get very first word before and after 'text'

4 Answers4