0

So I wrote some code to extract only what's within the <p> tags of some HTML code. Here is my code

soup = BeautifulSoup(my_string, 'html')
no_tags=' '.join(el.string for el in soup.find_all('p', text=True))

It works how I want it to for most of the examples it is run on, but I have noticed that in examples such as

<p>hello, how are you <code>other code</code> my name is joe</p>

it returns nothing. I suppose this is because there are other tags within the <p> tags. So just to be clear, what I would want it to return is

hello, how are you my name is joe

can someone help me out regarding how to deal with such examples?

user1893354
  • 5,778
  • 12
  • 46
  • 83

1 Answers1

2

Your guess is correct. According to BeautifulSoup documentation, .string returns None when there are more than 1 children (and that is the case in your example).

Now, you have a few options. First is to use .contents and recursively iterate over it, checking the value of .string on each of its visited children.

This approach can be a hassle in the long run. Fortunately enough, BeautifulSoup 4 offers method called .strings which enables you to do exactly what you want in an easy way.

Finally, if you know the text is going to be simple and you want an easy solution, you can also use regular expressions and replace all /<[^>]*>/ with an empty string. You must be, however, aware of the consequences.

Community
  • 1
  • 1
rr-
  • 14,303
  • 6
  • 45
  • 67
  • So are you saying that I should just replace .string with .strings to get the desired result? – user1893354 Sep 17 '13 at 16:33
  • I don't think that .strings does what I want. It looks like .strings just removes all the tags. What I want is to keep only the strings within the p tags but not in any other tags within the p tags like in the example I provided. – user1893354 Sep 17 '13 at 16:48
  • In that case, you don't even need recursion. Try this (untested): `no_tags=' '.join(child.string for child in el.children for el in soup.find_all('p', text=True))`. Basically, it should return `["hello, how are you", None, "my name is joe"]` before joining. – rr- Sep 17 '13 at 17:06
  • it says el is not defined. Can you have two for's like that? – user1893354 Sep 17 '13 at 17:09