4

I want to be able to select a string between two specific substrings (as the following), but with a loop that will iterate through every row in the dataframe.

CODE:

import pandas as pd

df = pd.DataFrame(['first: hello1 \nSecond this1 is1 a1 third: test1\n', 'first: hello2 \nSecond this2 is2 a2 third: test2\n', 'first: hello3 \nSecond this3 is3 a3 third: test3\n'])
df = df.rename(columns={0: "text"})

def find_between(df, start, end):
  return (df.split(start))[1].split(end)[0]

df2 = df['text'][0]
print(find_between(df3, 'first:', '\nSecond'))

[OUTPUT NEEDED] Dataframe with following information:

   output
0  hello1
1  hello2
2  hello3

the find_between() function is created based on Find string between two substrings, but here you're only able to do that for one specific variable (df2) that is already saved as a string, as the shown example. I need to be able to do that for every row (strings) in the 'df' dataframe.

I would really much appreciate if anyone can help me with this! Thank you!

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
lalatei
  • 57
  • 1
  • 8

1 Answers1

3

Why define a function? You can just use str.extract:

start = 'first'
end = '\nSecond'

df.text.str.extract(r'(?<={})(.*?)(?={})'.format(start, end), expand=False)

0    : hello1 
1    : hello2 
2    : hello3 
Name: text, dtype: object

Details

(?<=   # lookbehind
first
)
(      # capture-group
.*?    # non-greedy match
)
(?=    # lookahead
\nSecond
)

Everything in between the lookbehind and the lookahead are captured.


You can call str.split a few times, but this isn't nearly as elegant:

df.text.str.split(start).str[1].str.split(end).str[0]

0    : hello1 
1    : hello2 
2    : hello3 
Name: text, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Don't know about performance, but I would say the `split` approach is drastically *more* elegant than using a regular expression. It probably boils down to pure opinion though. – ely Mar 06 '18 at 18:38