Table Scraping with same class attributes

Question

I am trying to scrape the prayer time from a website which is www.hujjat.org.

Here is the html part of the area I am interested in (as you may have noticed the class attribute is the same for all the 4 prayers):

<table width="100%">
    <tbody>
        <tr>
            <td class="NamaazTimes">
                <div class="NamaazTimeName">Fajr</div>
                <div class="NamaazTime">04:42</div>
            </td>
            <td class="NamaazTimes">
                <div class="NamaazTimeName">Sunrise</div>
                <div class="NamaazTime">06:32</div>
            </td>
            <td class="NamaazTimes">
                <div class="NamaazTimeName">Zohr</div>
                <div class="NamaazTime">13:02</div>
            </td>
            <td class="NamaazTimes">
                <div class="NamaazTimeName">Maghrib</div>
                <div class="NamaazTime">19:33</div>
            </td>
        </tr>
    </tbody>
</table>

So far I have written the following code:

# import libraries
import json
import urllib2
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.hujjat.org/'
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soap and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

table = soup.find("div",class_="NamaazTimeName", text="Fajr").find_previous("table")
for row in table.find_all("tr"):
    a = row.find_all("td")

   # print(row.find_all("td"))

print (a)

And my result is :

[<td class="NamaazTimes">\n<div class="NamaazTimeName">Fajr</div>\n<div class="NamaazTime">04:42</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Sunrise</div>\n<div class="NamaazTime">06:32</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Zohr</div>\n<div class="NamaazTime">13:02</div>\n</td>, <td class="NamaazTimes">\n<div class="NamaazTimeName">Maghrib</div>\n<div class="NamaazTime">19:33</div>\n</td>]

What I want from my code is just the time for each of the prayer e.g. If it is "Fajr" prayer then the output should be "04:42". I then want to save this "04:42" in a text file.

Can someone help me please?

Thanks.

Pedro · Answer 1 · 2018-09-13T11:17:28.767

1

I'd suggest you use select instead of find in order to make a query more similar to a browser's css selectors. This way you could just get all the inner texts in the same list and work from there.

Something similar to this should help:

# import libraries
import json
import urllib2
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.hujjat.org/'
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soap and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

table = soup.find("div",class_="NamaazTimeName", text="Fajr").find_previous("table")
texts = [x.text for x in table.select("td.NamaazTimes div")]
only_times = [texts[x+1] for x in range(0, len(texts), 2)]

# we'll open the file in a with block, so we don't need to close it
with open("foo.txt", "w") as fp:
    # you'll need to iterate each string
    for row in only_times:
        fp.write(row + "\n")

EDIT(2): Re-phrased my comments in the code EDIT(3): Did some sode cleanup and changed to only store the times.

edited Sep 13 '18 at 11:17

answered Sep 12 '18 at 15:53

Pedro

182
5
13

Hi Pedro Thank you for your answer. I tried your code and I get the following: [(u'Fajr', u'04:42'), (u'Sunrise', u'06:32'), (u'Zohr', u'13:02'), (u'Maghrib', u'19:33')] I was wondering if I could change and only get e.g. Fajr 04:42? Is that possible? – Ahmed.B Sep 12 '18 at 22:45
Are you referring to the concatenation of the two fields? Or the 'u' suffix? In case you are referring to the u suffix, it means that your strings are in unicode instead of ascii. I am not sure if you can "turn it off". But it shouldn't show up when you save it to the file. You can always encode them as ascii, but you might need to make some decisions... – Pedro Sep 13 '18 at 02:17
Please take a look [here](https://stackoverflow.com/a/1207479/5409287), [here](https://stackoverflow.com/a/761459/5409287) or [here](https://docs.python.org/2/howto/unicode.html) – Pedro Sep 13 '18 at 02:20
Hi Pedro so I added the following lines to save it to a text file: – Ahmed.B Sep 13 '18 at 09:26
fo = open('/home/homeassistant/.homeassistant/python_scripts/test3.txt', 'w') fo.write(times_pairs) fo.close() – Ahmed.B Sep 13 '18 at 09:28
But I get this error now: [(u'Fajr', u'04:44'), (u'Sunrise', u'06:34'), (u'Zohr', u'13:02'), (u'Maghrib', u'19:31')] Traceback (most recent call last): File "test3.py", line 27, in fo.write(times_pairs) TypeError: expected a string or other character buffer object – Ahmed.B Sep 13 '18 at 09:28
Hi Pedro....thank you for the answer. I have tried to iterate each string using this method: – Ahmed.B Sep 13 '18 at 10:41
str2 = "Fajr" startnum = int(times_pairs_str.index(str2)) newname = times_pairs_str[startnum + 5 :startnum + 10] print (newname) with open("/home/homeassistant/.homeassistant/python_scripts/test3.txt", "w") as fp: # you'll need to iterate each string for row in times_pairs_str: fp.write(row + "\n") – Ahmed.B Sep 13 '18 at 10:42
but I get an error : Traceback (most recent call last): File "test3.py", line 32, in startnum = int(times_pairs_str.index(str2)) ValueError: 'Fajr' is not in list – Ahmed.B Sep 13 '18 at 10:42
Wait, maybe my phrasing in the code's comments was a bit misleading. The code as is should work. You'll need to change "foo.txt" with your path to your file. – Pedro Sep 13 '18 at 10:58
I have done that and in my test3.text file the data save is as a list with 4 lines i.e. – Ahmed.B Sep 13 '18 at 11:02
`Fajr 04:44` `Sunrise 06:34` `Zohr 13:02` `Maghrib 19:31` – Ahmed.B Sep 13 '18 at 11:03
I wanted just the time so when I save my results in my test3.txt, the test3.txt file should only contain **ONE** time like this **`04:44`** and nothing else. `find()` doesnt work with lists I tried it. – Ahmed.B Sep 13 '18 at 11:06
Pedro is it possible to show only one time? only one like **04:44**?? – Ahmed.B Sep 13 '18 at 11:56
I see from your other comments that you would like the times in different files. How should the files be named? Numerically? Or using one of the fields? Maybe you could update your question to also have this information. – Pedro Sep 13 '18 at 12:44

teller.py3 · Accepted Answer · 2018-09-27T13:01:56.083

1

This works:

from bs4 import BeautifulSoup
import requests

url = 'https://www.hujjat.org/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
path = 'C:/Users/John/Documents/Python/'

namaazNames = soup.select('div.NamaazTimeName')
namaazNames = [namaazName.text for namaazName in namaazNames]
namaazTimes = soup.select('div.NamaazTime')
namaazTimes = [namaazTime.text for namaazTime in namaazTimes]
del namaazNames[1]
del namaazTimes[1]

for namaazName, namaazTime in zip(namaazNames, namaazTimes):
    with open(path + namaazName + '.txt', 'w') as file:
        file.write(namaazTime)

edited Sep 27 '18 at 13:01

answered Sep 13 '18 at 12:25

teller.py3

822
8
22

Thanks @jxpython. It does work however I forgot to mention in my question that I want to save one particular time in my text file. So for example just **04:44**. How can I do that? I am not too familiar with slice in python. Thanks – Ahmed.B Sep 13 '18 at 12:33
So you only need 04:42 in your text file? – teller.py3 Sep 13 '18 at 12:34
Thank you @jxpython. So what are the text files called? I put a text file name and it throws this error: – Ahmed.B Sep 13 '18 at 12:48
`Traceback (most recent call last): File "test4.py", line 18, in with open(a + '/home/homeassistant/.homeassistant/python_scripts/test3.txt', 'w') as file: IOError: [Errno 2] No such file or directory: u'Fajr/home/homeassistant/.homeassistant/python_scripts/test3.txt'` – Ahmed.B Sep 13 '18 at 12:50
The names of the text files are the prayer names. So Fajr, Sunrise, Zohr and Maghrib. – teller.py3 Sep 13 '18 at 12:50
I seems like the directory you are giving doesn't exist. – teller.py3 Sep 13 '18 at 12:56
No just leave the code as it is. The four text files will be saved in the same directory as the Python file. – teller.py3 Sep 13 '18 at 13:00
Well thanks a lot for your help @jxpython I see that now. – Ahmed.B Sep 13 '18 at 13:01
God Bless you I definitely will ! – Ahmed.B Sep 13 '18 at 13:03
sorry to bother you again on this. Just for my future reference, if I wanted to change the location of all the 4 text files where they are saved. How do I do that? thanks – Ahmed.B Sep 13 '18 at 13:41
Change the path variable to the full directory path you want to save the text files. Line 7. – teller.py3 Sep 13 '18 at 13:49
That did not work I am afraid. I specified the path in line 7 but I cannot see any .txt file being saved in that folder. – Ahmed.B Sep 13 '18 at 13:57
You didn't forget the last "/" at the end of the path? If you didn't, show me the path you used. – teller.py3 Sep 13 '18 at 14:03
You are correct I forgot the last "/" Thank you very much. – Ahmed.B Sep 13 '18 at 14:05
ok I have sorted my previous query I just wanted to ask you as you have used the for loop in the code I only wanted 3 prayer times and not the sunrise time. So I basically I need the output to delete or not show the 2nd row which is the sunrise time. I used sys.stdout.write(ERASE_LINE) but I cannot explicitly specify the 2nd row. Any help will be appreciated as usual. Thanks – Ahmed.B Sep 27 '18 at 10:33
I have made an attempt but it produces "not working" even when the actual time matches the prayer time. I have put my code above thanks. – Ahmed.B Sep 27 '18 at 11:38
So you want to exclude Sunrise? Only include the other three prayer times and make a txt file? – teller.py3 Sep 27 '18 at 12:55
Well yeah yes but I cannot post my code here if you could please see this question I asked and my code maybe it will make better sense. Thanks. https://stackoverflow.com/questions/52536372/match-two-values-and-result-an-output – Ahmed.B Sep 27 '18 at 12:58
so what does your code do now? just print 3 times and not sunrise? – Ahmed.B Sep 27 '18 at 13:05
Make txt files with each prayer time. Sunrise not. So only Fajr, Zohr and Maghrib – teller.py3 Sep 27 '18 at 13:06
right ok thanks for that. I now want to match each prayer time to the current time and play an mp3 file. I have attempted it but it does not execute. Can you please look into this link https://stackoverflow.com/questions/52536372/match-two-values-and-result-an-output ? thanks – Ahmed.B Sep 27 '18 at 13:09
Let me check in a moment – teller.py3 Sep 27 '18 at 13:15

iamklaus · Answer 3 · 2018-09-13T07:41:19.970

0

    from bs4 import BeautifulSoup
    import pandas as pd

    data = BeautifulSoup(#HTML data)

    NamaazName = data.find_all('div', {'class':'NamaazTimeName'})
    NamaazTime = data.find_all('div', {'class':'NamaazTime'})

    for i in range(len(NamaazName)):
        coll[NamaazName[i].text] = NamaazTime[i].text

    master_data.columns=pd.DataFrame()

    master_data['NamaazName'] = coll.keys()
    master_data['NamaazTime'] = coll.values()

   print(master_data)

Output

    Nammaz  NammazTime
0    Fajr     04:42 
1    Sunrise  06:32 
2    Zohr     13:02 
3    Maghrib  19:33

edited Sep 13 '18 at 07:41

answered Sep 12 '18 at 16:16

iamklaus

3,720
2
12
21

Hi Sarthank I tried your code but I dont think its complete. Can you please elaborate more? – Ahmed.B Sep 12 '18 at 22:43
could you tell where you are facing the problem because according to the raw html given in your problem above code works.. – iamklaus Sep 13 '18 at 07:35
Basically is there anything I need to add to your code? As I copied it but throws errors – Ahmed.B Sep 13 '18 at 08:14
whats the error ?.. also the code is based on the raw html which you pasted above..you will have to make changes accordingly – iamklaus Sep 13 '18 at 08:31
I get the error: File "test3.py", line 15 NamaazTime = data.find_all('div', {'class':'NamaazTime'}) ^ SyntaxError: invalid syntax – Ahmed.B Sep 13 '18 at 09:09
there is no syntax error in that line...make sure data is your soup object..do one thing take your raw html which you pasted above, insert in a variable then pass it through soup (#HTML PART)...then run the above code you will see..o – iamklaus Sep 13 '18 at 10:19

Table Scraping with same class attributes

3 Answers3