I am trying to extract everything between the number of year and the word "vehicles".
I tested the below regex on regexr, and it highlighted the expected phrase I wanted to extract(shown below in italics)
(?<=[1-2][0-9]{3}\s)(.+)(?=\svehicles)
"2023 Civic Type R vehicles. The driver's seat frame is wrong."
So I used the below code to extract it into a new column:
df['newcol'] = df['colA'].str.extract(r'(?<=[1-2][0-9]{3}\s)(.+)(?=\svehicles)', expand=False)
However, this is giving me the full sentence as the result in my new col instead of just Civic Type R. What am I doing wrong and why the different outputs between regexr & jupyter lab?
Update:
I found that it is giving me this problem because there is another instance of "vehicles" further down the sentence. I wasn't aware of that.
How can I modify my regex to only capture until the first instance of the word?
Thanks