0

I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user. However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.

# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time


# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")

# make HTML readable and covert HTML to a string
content = str(content.read())

# select part of the string containing the prayer time table
table = content[24885:24935]

print(table)  # print to test what is being selected

I'm not sure what's going on here.

  • It would be better to use a library that parses the html and can extract a specific element such as a table based on its properties. beautifulsoup is one such parser for python. It is available at https://pypi.python.org/pypi/beautifulsoup4 and there is an exmple of using it for table extraction at http://stackoverflow.com/questions/11790535/extracting-data-from-html-table. Using it with http://www.islamicfinder.org/prayerDetail.php will be more difficult since view source shows it does not assign a class to tables and also nests them, but it does assign the same class to all td elements. –  Jul 26 '15 at 21:33

2 Answers2

1

You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.

from bs4 import BeautifulSoup
import requests

url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)

for table in soup.find_all('table'):
    # here you can find the table you want and deal with the results
    print(table)
0

You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time

What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()

better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference

from lxml import etree
import urllib

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
Community
  • 1
  • 1
Syntactic Fructose
  • 18,936
  • 23
  • 91
  • 177