Scraping and title from some

Question

I am doing web scraping and have done this so far-

page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))

After doing this, I have something like this when I print all_p-

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes.

Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me. I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.

Thanks, akhi

HedgeHog · Answer 1 · 2023-02-07T07:01:55.077

Just in case

Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class and <a>. So take your soup and select like this:

soup.select('.p-list-sec a')

To shape the information you like to process you can use a single for loop or a list comprehension all in one line:

[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

Output

[{'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'}]

To store it in an csv feel free to push it into `pandas` or `csv`

Pandas:

import pandas as pd

pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)

CSV:

import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

keys = data_list[0].keys()

with open('url.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_list)

Synthase · Answer 2 · 2021-01-12T16:29:51.830

In case you don't mind about div name, here is a oneliner:

import re

with open("data.html", "r") as msg:
    data = msg.readlines()

data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'\1'+' '+r'\2',v).split()) for v in [v.strip() for v in data if "href" in v]]

Output:

[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]

Otherwise:

with open("data.html", "r") as msg:
    data = msg.readlines()

div_write = False
href_write = False

wdata = []; odata = []

for line in data:
    if '<div class =' in line:
        class_name = line.split("<div class =")[1].split(">")[0].strip()
        div_write = True
    if "</div>" in line and div_write == True:
        odata.append(wdata)
        wdata = []
        div_write = False

    if div_write == True and "< a href" in line:
        href = line.strip().split("< a href =")[1].split(",")[0].strip()
        title = line.strip().split("title =")[1].split(">")[0].strip()
        wdata.append(class_name+" "+href+" "+title)

with open("out.dat", "w") as msg:
    for wdata in odata:
        msg.write("\n".join(wdata)+"\n\n")

With this you save a file in which you keep track of the information and section name.

Output:

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

I have done it with simple double looping. I don't know much about html. — ABD, Jan 12 '21 at 16:23
I guess it's not fair to downvote the correct answer, so I'll retract the downvote. But parsing it manually (with splits and/or regex) just loses the whole point of using the dedicated parser like Beautifulsoup. And also [don't parse HTML with regex](https://stackoverflow.com/a/1732454/4727702) — Yevhen Kuzmovych, Jan 12 '21 at 16:25
@Yevhen you are totally right. However, and not for general purpose, it is often easier/faster to scrap using general python. Libs are often great, but then you need to get into the documentation before being at ease with. Sometimes it is necessary, sometimes not, and you get opportunities to improve in general programming. Thanks for retracting the downvote tho! — Synthase, Jan 12 '21 at 16:27

score 1 · Accepted Answer · answered Jan 12 '21 at 16:02

1

Would this work?

...

for p in all_p:
    for link in p.find_all('a'):
        print(link['href'])
        print(link.text) # or link['title']

answered Jan 12 '21 at 16:02

Yevhen Kuzmovych

10,940
7
28
48

Yeah.. I had to do double loop which I was missing. Thanks – ABD Jan 12 '21 at 16:19

score 0 · Answer 4 · answered Jan 12 '21 at 16:16

0

I was able to do this by this-

for p in all_p:
    for b in p.findAll('a'):                                         
        fullLink = str(b.get('href'))
        title = str(b.get('title'))
        href = 'link = {}, title = {}\n'.format(fullLink, title)
        print(href)

It works fine for me. Thanks

answered Jan 12 '21 at 16:16

ABD

180
11

3

Older syntax `findAll` is still working, but you should use `find_all()` instead. – HedgeHog Jan 12 '21 at 16:19
Thanks. Will use that. – ABD Jan 12 '21 at 16:23

Scraping and title from some

4 Answers4

Just in case

To store it in an csv feel free to push it into pandas or csv

To store it in an csv feel free to push it into `pandas` or `csv`