0

I am tryin to format bibliography in a dataframe. Basically, for a column named "Bibliography", I want to extract the titles, sometimes delimited by and sometimes delimited by "

Now when I use

df.loc[df['Bibliography'].str.extract(?<=&quot;)(.*?)(?=,&quot;)

It correctly extracts the titles delimited by " (but will produce NaN for titles delimited by )

So I tried applying str.extract over a slice of the data frame using .loc

df.loc[df['Bibliography'].str.contains('&quot;'),'Bibliography']=df.loc[df['Bibliography'].str.contains('&quot;'),'Bibliography'].str.extract(r'(?<=&quot;)(.*?)(?=,&quot;)')

But this results in NaN. I can't figure out why I can't use extract over a slice of the data frame.

kylemaxim
  • 95
  • 5
  • 1
    Extract has 3 capture groups, resulting in a dataframe with 3 columns. You can't just assign a dataframe to a column. – Quang Hoang Jan 21 '21 at 18:54
  • 1
    There are many examples on SO to search for multiple patterns of strings in a column. Can you do a search on Stackoverflow.com – Joe Ferndz Jan 21 '21 at 18:56
  • @QuangHoang Capture groups is not the issue (these are lookaheads, behinds, and the issue doesn't arise when not using .loc) – kylemaxim Jan 21 '21 at 19:00
  • @JoeFerndz Yeah, I know, I'm searching. What I don't understand specifically is why .loc and str.extract don't work – kylemaxim Jan 21 '21 at 19:01
  • It seems another user had this issue too so nvm https://stackoverflow.com/questions/63210734/using-bool-index-on-df-loc-str-extract-returns-unexpected-result – kylemaxim Jan 21 '21 at 19:10

0 Answers0