Extracting data from different type of html using beautifulsoup in python

Question

I have the following types of HTML and I need to extract the "Student ID" from it. I could extract the student id from the HTML below, but I am not sure how can I modify my code so that I can correctly extract "Student ID" from the second type of HTML as well. Type1:

student_html='''
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
  <span style="font-family: Helvetica; font-size:8px">
   123456
   <br/>
  </span>
</div>

<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student Name
  <span style="font-family: Helvetica; font-size:8px">
   John Doe
   <br/>
  </span>
</div>
'''

I am using the following code to extract the "Student ID" from the above HTML

from bs4 import BeautifulSoup
soup=BeautifulSoup(student_html,"lxml")
span_tags=soup.find_all("span")
for span in span_tags:
    if span.text.strip()=="Student ID":
       student_id=span.findNext("span").text
    if span.text.strip()=="Student Name":
       student_name=span.findNext("span").text

This is the second type of HTML. Type2

type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
   <br/>
   123456
   <br/>
  </span>
</div>
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student Name
   <br/>
   John Doe
   <br/>
  </span>
</div>
'''

How can I modify the above code to extract the student ID from this?Similarly I need to extract other information:Student Name,Address, Grade etc

score 1 · Answer 1 · answered May 25 '21 at 13:02

1

You could try this, once you have the right <div> tags scooped out of the source HTML.

For example:

from bs4 import BeautifulSoup

type_one = """
<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
  <span style="font-family: Helvetica; font-size:8px">
   123456
   <br/>
  </span>
</div>"""

type_two = """<div style= "position:absolute; border:textbook 1px solid">
  <span style="font-family: Helvetica; font-size:8px">
   Student ID
   <br/>
   123456
   <br/>
  </span>
</div>
"""

all_types = [type_one, type_two]

for _type in all_types:
    _id = (
        BeautifulSoup(_type, "lxml")
        .find("span")
        .getText(strip=True, separator="|")
        .split("|")[-1]
    )
    print(_id)

Output:

123456
123456

answered May 25 '21 at 13:02

baduker

19,152
9
33
56

Interesting!! Can we use regex too? – Nanthakumar J J May 25 '21 at 13:07
@NanthakumarJJ Use regex to parse HTML? No, it's not a good idea. [Here's why](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not). – baduker May 25 '21 at 13:08
Yeah, you could but for this task this is an overkill. – baduker May 25 '21 at 13:09
thanks,...but there are multiple fields,whose info I need to extract.Your code works perfectly for the student id...but I am not able to extract correct div for the corresponding Student ID and other fields – TLanni May 25 '21 at 14:30
@TLanni then either update your question with relevant details or share the URL. – baduker May 25 '21 at 14:35
@baduker I have updated my question and also updated the sample code i have written which works for the type1. In that I am checking if the span contains particular filed like below: soup=BeautifulSoup(student_html,"lxml") span_tags=soup.find_all("span") for span in span_tags: if span.text.strip()=="Student ID": student_id=span.findNext("span").text – TLanni May 25 '21 at 14:39

score -1 · Answer 2 · answered May 25 '21 at 13:30

If you're free to use other modules, consider the following solution:

    from weblib.etree import parse_html
    from selection import XpathSelector

        student_html='''
    <div style= "position:absolute; border:textbook 1px solid">
      <span style="font-family: Helvetica; font-size:8px">
       Student ID
      <span style="font-family: Helvetica; font-size:8px">
       123456
       <br/>
      </span>
    </div>'''
    
        type2HTML = '''<div style= "position:absolute; border:textbook 1px solid">
      <span style="font-family: Helvetica; font-size:8px">
       Student ID
       <br/>
       123456
       <br/>
      </span>
    </div>'''

    all_types = [student_html, type2HTML]

    for _type in all_types:
        node = parse_html(_type)

        nodes = [node for node in XpathSelector(node).select('//span')]

        if len(nodes) == 1:
            content = nodes[0].text()
        else:
            content = nodes[1].text()

        student_id = content.replace('Student ID', '').strip()

        print(student_id)

output

123456
123456

Extracting data from different type of html using beautifulsoup in python

2 Answers2