0

I want to extract a block of text within the div tag. I've seen several posts discussing various div attributes, but the tag I want has no attributes - it's just < div>.

Below is an excerpt of the html. There are dozens of div tags above and below it, but this is the only one that is just < div>.

<div>
      <!-- Some text. -->
      <i>
       [Text I want block 1]
      </i>
      text I want 1
      <br/>
      text I want 2
      <br/>
      text I want 3
      <br/>
      <br/>
 </div>

However, any find method with "div" returns too much. I tried the following:

1) String and tag searches pickup every tag containing div

soup.find("div")

soup.div

3) Isolating the parent, then div searching within still returns too much.

divParent = soup.find("div", class_="col-xs-12 col-lg-8 text-center")
divParent.find("div")

Any ideas? Div seems to be too common of a tag/string to isolate.

BIMperson
  • 485
  • 5
  • 8
  • can't you get directly elements from div - maybe they have useful attributes. You could try css selector `selector('div i')`. You can also count divs manually and use index - ie. get third divs `find_all('div')[2]` – furas Jan 04 '18 at 03:42
  • 1
    better add in question real url and then we can see problem and test solutions. – furas Jan 04 '18 at 03:49

1 Answers1

1

This can be one way of doing the job:

from bs4 import BeautifulSoup

content='''
<div>
      <!-- Some text. -->
      <i>
       [Text I want block 1]
      </i>
      text I want 1
      <br/>
      text I want 2
      <br/>
      text I want 3
      <br/>
      <br/>
 </div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
data = ''.join([item.parent.text.strip() for item in soup.select('div i')])
print(data)
SIM
  • 21,997
  • 5
  • 37
  • 109
  • That worked. Thanks. Just trying to understand what's going on here.You're joining (what does the '' do?) the list created from a for loop. – BIMperson Jan 04 '18 at 14:50
  • Accidentally saved the above comment before finishing. I meant to say: You're joining results from a list. What does the '' do? You're selecting the 'div i' tag, then selecting its parent. Is that right? – BIMperson Jan 04 '18 at 14:59
  • This is the basic syntax of join `"".join()` mind the two double quote before dot. You can check out this link for the clarity about join function [Link](https://stackoverflow.com/questions/1876191/what-exactly-does-the-join-method-do). Ain't that what you asked? – SIM Jan 04 '18 at 15:08
  • That was the question. Thanks. I get it now. – BIMperson Feb 14 '18 at 19:40