Unable to parse the HTML in python using beautifulsoup

Question

I'm new to programming as well as python. Pls do bear with me. :) The link I'm trying to parse is http://results.vtu.ac.in/vitavi.php?rid=1JS10CS007&submit=SUBMIT

On that link; I need to scrape a few things and i've marked those in the Image attached in this post. I'm unable to do it myself as the code of the page isn't well-written/organised. Pls do help me on this regard. Thanks.

I have written a program to get the code of the page. Here it is:

from bs4 import BeautifulSoup
from urllib2 import urlopen
mylink = "http://results.vtu.ac.in/vitavi.php?rid=1JS10CS007&submit=SUBMIT"
pagetext = urlopen(mylink).read()
soup = BeautifulSoup(pagetext)
print soup.prettify()

[Let me google that for you...](http://stackoverflow.com/questions/4462061/beautiful-soup-to-parse-url-to-get-another-urls-data) [this also looks like it could be help](http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) — Andy_Lima, Jul 15 '15 at 15:22

score 3 · Answer 1 · answered Jul 15 '15 at 15:23

Assuming you want to get the contents of the result table.

There are no data-oriented classes or ids and there are a lot of nested tables making it more difficult to locate the desired data.

I would find the Subject element and find the first parent table. Then, iterate over rows and cells and grab the desired data:

from urllib2 import urlopen

from bs4 import BeautifulSoup

url = "http://results.vtu.ac.in/vitavi.php?rid=1JS10CS007&submit=SUBMIT"

soup = BeautifulSoup(urlopen(url))
results_table = soup.find(text="Subject").find_parent("table")

for row in results_table.find_all("tr"):
    print [cell.get_text(strip=True) for cell in row.find_all("td")]

Prints:

[u'Subject', u'External', u'Internal', u'Total', u'Result']
[u'Software Architectures (10IS81)', u'46', u'21', u'67', u'P']
[u'System Modeling and Simulation (10CS82)', u'45', u'15', u'60', u'P']
[u'Software Testing (10CS842)', u'41', u'15', u'56', u'P']
[u'Project Work (10CS85)', u'95', u'97', u'192', u'P']
[u'Information and Network Security (10CS835)', u'39', u'20', u'59', u'P']
[u'Seminar (10CS86)', u'0', u'44', u'44', u'P']

Hello alecxe, Thank you very much,, I think the code you've posted should pretty work for me.. — Arpith P Muddi, Jul 16 '15 at 17:10
@ArpithPMuddi It should, please see http://stackoverflow.com/help/someone-answers. Thanks! — alecxe, Jul 16 '15 at 17:11

Unable to parse the HTML in python using beautifulsoup

1 Answers1

Linked