0

I haved scraped data from Wikipedia and created a dataframe. df[0] contains

\n \n == Sifat-sifat DNA == \n  DNA merupakan sebuah polimer yang terdiri dari satuan-satuan berulang yang disebut nukleotida.     Tiap-tiap nukleotida terdiri dari tiga komponen utama, yakni gugus fungsionalgugus fosfat, gula deoksiribosa, dan basa nitrogen (nukleobasa) < ref > {{en}}{{cite web \n  url          = http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mboc4 & part=A2 \n  title        = All Cells Replicate Their Hereditary Information by Templated Polymerization \n  accessdate   = 2010-03-19 \n  work         = Bruce Alberts, et al. \n }} < /ref > . Pada DNA, nukleobasa yang ditemukan adalah Adenina (A), Guanina (G), Sitosina (C) dan Timina (T).

I want to remove:

< ref > {{en}}{{cite web \n  url          = http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mboc4 & part=A2 \n  title        = All Cells Replicate Their Hereditary Information by Templated Polymerization \n  accessdate   = 2010-03-19 \n  work         = Bruce Alberts, et al. \n }} < /ref > 

I need a way to do a replace(or just delete) and text in between "< ref >" and " < /ref >" so that when I call it, df[0] now equals:

\n \n == Sifat-sifat DNA == \n  DNA merupakan sebuah polimer yang terdiri dari satuan-satuan berulang yang disebut nukleotida.     Tiap-tiap nukleotida terdiri dari tiga komponen utama, yakni gugus fungsionalgugus fosfat, gula deoksiribosa, dan basa nitrogen (nukleobasa). Pada DNA, nukleobasa yang ditemukan adalah Adenina (A), Guanina (G), Sitosina (C) dan Timina (T).

I have tried:

df['Body'] = df['Body'].str.replace('< ref >.*?< /ref >','',regex=True)
df['Body'] = df['Body'].str.replace('< ref >.*< \/ref >','',regex=True)

but the output is still not change, like this

\n \n == Sifat-sifat DNA == \n  DNA merupakan sebuah polimer yang terdiri dari satuan-satuan berulang yang disebut nukleotida.     Tiap-tiap nukleotida terdiri dari tiga komponen utama, yakni gugus fungsionalgugus fosfat, gula deoksiribosa, dan basa nitrogen (nukleobasa) < ref > {{en}}{{cite web \n  url          = http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mboc4 & part=A2 \n  title        = All Cells Replicate Their Hereditary Information by Templated Polymerization \n  accessdate   = 2010-03-19 \n  work         = Bruce Alberts, et al. \n }} < /ref > . Pada DNA, nukleobasa yang ditemukan adalah Adenina (A), Guanina (G), Sitosina (C) dan Timina (T).

Whereas I need is like I explained before. I can't find any wildcards that seem to work. Any help is much appreciated.

ohai
  • 183
  • 10
  • @PacketLoss I have tried it too, but the result is same like I explained above – ohai Feb 06 '20 at 23:51
  • I suggest using http://regex101.com to test your regex to see how it works. – Code-Apprentice Feb 06 '20 at 23:56
  • Your regex appears to be correct: https://regex101.com/r/X5ydjA/1. Maybe you are not using the `replace()` function correctly? – Code-Apprentice Feb 06 '20 at 23:58
  • @Code-Apprentice does this code:`df['Body'] = df['Body'].str.replace('< ref >.*< \/ref >','',regex=True)` aren't correct? – ohai Feb 07 '20 at 00:00
  • Why is that data in a DataFrame? How did you scrape it? Why are you seemingly using regex to parse HTML? We're missing some important information, I think. Doesn't wikipedia have APIs, too? – AMC Feb 07 '20 at 01:42
  • Also, please share a [mcve]. – AMC Feb 07 '20 at 01:42

1 Answers1

3

The problem is that Python regex will not match the newlines with the dot by default. What we can do is to match everything until the closing ref

df['Body'] = df['Body'].str.replace('< ref >[\s\S]*< /ref >', '', regex=True)

I got the idea for the regex from here: matching any character including newlines in a Python regex subexpression, not globally