0

Problem definition

The goal is to strip away each row of its html tags and save them in the dataframe.

The dataframe is defined as:

test = pd.DataFrame(data=["<p> test 1 </p>", "<p> random text </p>"], columns=["text"])

I already found this elegant answer to solve the problem. However for curiosity's sake I want to try to achieve the same result using a for loop.

Solution with list comprehension:

test['text'] = [BeautifulSoup(text,"lxml").get_text() for text in test['text'] ]

Attempt with for loop with iterative approach to solution:

First attempt:

This code has the variable text iterative over every element of the dataframe test and print out the result. So far so good.

for text in test['text']:

print(text)

Second attempt:

This code does the same thing with a stripped down version of the text.

for text in test['text']:

soup = BeautifulSoup(text,"lxml")

print(soup.get_text())

Third attempt:

Why is the result of this code a dataframe whose values are all "random text"?

test = pd.DataFrame(data=["<p> test 1 </p>", "<p> random text </p>"], columns=["text"])

for text in test['text']:

soup = BeautifulSoup(text,"lxml")

test["text"] = soup.get_text()

In the first loop the local variable text iterates over the first element of the dataframe which is "test 1". It turns it into a soup and adds it to the column "text" of the dataframe test. Same thing should happen in the second loop. Yet all that happens is that the value of the last loop is broadcasted over the whole column.

I think my last line of code actually broadcasts the same value to all rows of the dataframe. But how do I just modify the value that the variable text is taking in a given loop?

The whole post might look weird but I was thinking and testing while writing the post. I might find the solution myself and update the post. But I might stay stuck and need another perspective. Thank you for your time.

[1]: Pandas: Trouble Stripping HTML Tags From DataFrame Column

nid
  • 155
  • 3
  • 9
  • [What do you have against list comprehensions?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care) – cs95 Feb 04 '19 at 20:59
  • Nothing. I'm a beginner that is trying to better understand for loops. if I ever encounter a similar situation like this one where a list comprehension would be too messy I could more easily write a for loop. Thanks for the link, great read. – nid Feb 04 '19 at 21:02
  • The issue why, in your third attempt, all the values are "random text" is because you are essentially assigning one string the the whole column: `test["text"] = soup.get_text()` this is essentially `test["text"] = 'test 1'` then on the second time through the loop it is `test["text"] = 'random text'`. You are assigning one value to the whole column over and over again so the last value will be in the whole column. You would need to append your data to an empty list. – It_is_Chris Feb 04 '19 at 21:09

1 Answers1

2

You can use regular expressions in order to remove the tags.

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)
Igor Dragushhak
  • 567
  • 1
  • 3
  • 14