0

Hello everyone I'm trying to pull certain text info from a website not all of the text is needed but I'm confused about how to do so when the text is in multiple divs. here is the code I'm looking at. But I get confused when there are multiple rows inside. I need to pull the "Number" title and the text (which is 837270), and the "Location" title and the text which is (Ohio)

                   <br>
                <br>
              </p>
            </div>
          </div>
          <div class="row">
            <div class="col-md-4">
                <p>
                  <span class="text-muted">Number</span>
                  <br>
                  "837270"
                </p>
            </div>
            <div class="col-md-4">
              <p>
                <span class="text-muted">Location</span>
                <br>
                "Ohio"
              </p>
            </div>
              <div class="col-md-4">
                <p>
                  <span class="text-muted">Office</span>
                <be>
                   "Joanna" 
                </p>
              </div>
          </div>
          <div class="row">
            <div class="col-md-4">
              <p>
                <span class="text-muted">Date</span>
              <be>
                "07/01/2022"
              </p>
            </div>
            <div class="col-md-4">
                <p>
                  <span class="text-muted">Type</span>
                <br>
                  "Business"
                </p>
            </div>
            <div class="col-md-4">
                <p>
                  <span class="text-muted">Status</span>
                  <br>
                  "Open"
                </p>
            </div>
          </div>
        </div>
      </div>

    </div>

I've tried this and it prints out none.

soup = BeautifulSoup(driver.page_source,'html.parser')  
df = soup.find('div', id = "Location")
print(df.string)

I want to pull it and save it. any help would be appreciated thank you.

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
seanofdead
  • 99
  • 1
  • 6
  • Can you show what your expected output should be? – sytech Mar 12 '22 at 07:07
  • Does this answer your question? [How to find elements by class](https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class) – Abhinav Mathur Mar 12 '22 at 07:09
  • @sytech I'm looking to put data into an excel file so something like A1: Location A2: Ohio, B1: Number, B2: 837270. But I cant get BS4 to find the data specifically since there aren't any tables on this page. – seanofdead Mar 12 '22 at 07:19
  • `soup.find('div', id = "Location")` wont work, cause there is no `div` with an attribute `id` - Would be great if you could improve your question with expected output and take minute to read: How to create [mcve]. Thanks – HedgeHog Mar 12 '22 at 07:30

1 Answers1

1

Sometimes HTML won't have IDs or other patterns that can be followed easily. You can get pretty clever with this though, you don't have to rely on HTML pages using table structures.

In this case, for example, it appears each section is titled by a <span class="text-muted"> tag and its value is the last sibling of that span tag.

To scrape each of these titles and their values, we can do something like this:

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')

for title_tag in soup.find_all('span', class_='text-muted'):

    # get the last sibling
    *_, value_tag = title_tag.next_siblings

    title = title_tag.text.strip()

    if isinstance(value_tag, bs4.element.Tag):
        value = value_tag.text.strip()
    else:  # it's a navigable string element
        value = value_tag.strip()

    print(title, value)

Output:

Number "837270"
Location "Ohio"
Office "Joanna"
Date "07/01/2022"
Type "Business"
Status "Open"

There are of course other patterns you could identify here to reliably get the values. This is just one example.

If you wanted to get just the Location, you could locate it by its text.

location_tag = soup.find('span', class_='text-muted', text='Location')

Then getting its value is the same in the above.

*_, location_value_element = location_tag.next_siblings
print(location_value_element.strip()) # "Ohio"
sytech
  • 29,298
  • 3
  • 45
  • 86
  • wow, this was awesome Thank you so much! I have 2 questions for you. if you don't mind. 1st. do you have to find all for the 'span' tag? or is there a way to specifically find only certain 'span' tags? 2nd could you explain briefly what the *_ line is doing? – seanofdead Mar 12 '22 at 07:47
  • 1
    @seanofdead updated to answer your first question. For your second question, it's a form of iterable unpacking. `.next_siblings` is a generator. I unpack (and discard to the variable `_`) all but the last value and assign the last value to that name. As another example, `*a, b = range(5)` would result in `a` being `[0,1,2,3]` and `b` being `4`. This is defined by [PEP-3132](https://peps.python.org/pep-3132/). Using the name `_` is just a convention in Python that says "this variable wont be used". – sytech Mar 12 '22 at 07:54
  • thank you so much for your help and time! I did run into an error with the edited part of the code specifically for the location only. : *_, location_value_element = location_tag.next_siblings AttributeError: 'NoneType' object has no attribute 'next_siblings' – seanofdead Mar 12 '22 at 08:15