Python read forms in webpage

Question

I read some webpage contents in html that has the following form:

<div class="cart">
                    <div class="cart-title">
                        <img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
                        המקצועות שלי
                    </div><div class="entry-spacer"></div><div class="cart-entry">
                    <div class="course-number">
                    <a href="https://ug3.technion.ac.il/rishum/course/104134">104134</a>
                </div>
                <div class="course-name">
                    אלגברה מודרנית ח                 
                </div>
                <div class="course-points">
                    2.5 נק'
                </div>
                <div class="entry-group">
                    קבוצה 11
                </div><div class="change-group">
                שנה קבוצה ל
                <select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
                    <option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
                </select>
                </div><div class="more-actions">
                </div>
                    <div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
                    <div class="course-number">
                    <a href="https://ug3.technion.ac.il/rishum/course/234118">234118</a>
                </div>
                <div class="course-name">
                    ארגון ותכנות המחשב               
                </div>
                <div class="course-points">
                    3 נק'
                </div>
                <div class="entry-group">
                    קבוצה 22
                </div><div class="change-group">
                שנה קבוצה ל
                <select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
                    <option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
                </select>
                </div><div class="more-actions">
                </div>
                    <div class="clear"></div></div><div>

Now the question is how can I read the courses numbers which appear in blue in my image??

Here's an example of how course number appears in the webpage:

<div class="course-number">
                    <a href="https://ug3.technion.ac.il/rishum/course/104134">104134</a>
                </div>

and I want to read: 104134 in this example

I would prefer saving all courses numbers in list or something similar — , Jan 19 '21 at 17:41
Note: I care A LOT about performance and want to do it with lxml — , Jan 19 '21 at 18:11
I don't know with what to replace this: print(tree.xpath('//a/@href')) with — , Jan 19 '21 at 18:26

EDG956 · Answer 1 · 2021-01-19T19:15:50.957

0

First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.

from bs4 import BeautifulSoup

r = requests.get(<your-target>)

soup = BeautifulSoup(r.text, 'lxml')

numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]

I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.

Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.

Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.

edited Jan 19 '21 at 19:15

answered Jan 19 '21 at 18:11

EDG956

797
8
23

Hi I want to use lxml as mentioned in my comments – Jan 19 '21 at 18:26
Sorry, @martin. I haven't used lxml directly so I can't come up with a solution for that. Shouldn't be too hard though. It's worth noting, as in the replies in the link below, that BeautifulSoup now supports lxml as an internal parser which gives it a boost on performance, but still not as good as pure lxml. Go through it, maybe you find out BeautifulSoup is good enough for your scenario. https://stackoverflow.com/questions/4967103/beautifulsoup-and-lxml-html-what-to-prefer – EDG956 Jan 19 '21 at 18:38
@martin I hadn't tested my code. I have updated it now and removed the issue. Please note that you should have lxml installed in the same python environment. – EDG956 Jan 19 '21 at 19:02
can you take a look at https://stackoverflow.com/questions/65797861/pip-in-python-doesnt-install-package-correctly – Jan 19 '21 at 19:07
you code doesn't work: numbers = [i.a.text for i in soup.find_all('div', class ="course-number")] ^ SyntaxError: invalid syntax – Jan 19 '21 at 19:08
@martin didn't try the "class" syntax. It's updated now. Regarding the link, you should import as `pip install bs4`. – EDG956 Jan 19 '21 at 19:17
Thanks, one last thing which you forgot, I have two types of courses one of them is in
– Jan 19 '21 at 19:19
I tried: attrs={"form":"regCart","class": "course-number"} but didn't work – Jan 19 '21 at 19:20
try `soup.find('form', attrs={'name': ''})` – EDG956 Jan 19 '21 at 19:21
It returns nothing – Jan 19 '21 at 19:29
@martin I'm not sure what you're iterating through. The form obtained from `soup.find()`? That will give you the found html subtree or NoneType if it doesn't find anything. I think the documentation is clear enough to solve the issue from now. As an extra, I think you want to search first based on the form and then obtain numbers, so here's a snippet. It may not work, but i'm sure you can find your way around: form = soup.find('form', ...) numbers = [i.a.text for i in form.find_all...] – EDG956 Jan 19 '21 at 19:31

Python read forms in webpage

1 Answers1