1

I am doing web scraping and have done this so far-

page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))

After doing this, I have something like this when I print all_p-

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes. 

Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me. I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.

Thanks, akhi

ABD
  • 180
  • 11

4 Answers4

2

Just in case

Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class and <a>. So take your soup and select like this:

soup.select('.p-list-sec a')

To shape the information you like to process you can use a single for loop or a list comprehension all in one line:

[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

Output

[{'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'}]

To store it in an csv feel free to push it into pandas or csv

Pandas:

import pandas as pd

pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)

CSV:

import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

keys = data_list[0].keys()

with open('url.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_list)
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
1

In case you don't mind about div name, here is a oneliner:

import re

with open("data.html", "r") as msg:
    data = msg.readlines()

data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'\1'+' '+r'\2',v).split()) for v in [v.strip() for v in data if "href" in v]]

Output:

[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]

Otherwise:

with open("data.html", "r") as msg:
    data = msg.readlines()

div_write = False
href_write = False

wdata = []; odata = []

for line in data:
    if '<div class =' in line:
        class_name = line.split("<div class =")[1].split(">")[0].strip()
        div_write = True
    if "</div>" in line and div_write == True:
        odata.append(wdata)
        wdata = []
        div_write = False

    if div_write == True and "< a href" in line:
        href = line.strip().split("< a href =")[1].split(",")[0].strip()
        title = line.strip().split("title =")[1].split(">")[0].strip()
        wdata.append(class_name+" "+href+" "+title)

with open("out.dat", "w") as msg:
    for wdata in odata:
        msg.write("\n".join(wdata)+"\n\n")

With this you save a file in which you keep track of the information and section name.

Output:

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
Synthase
  • 5,849
  • 2
  • 12
  • 34
  • I have done it with simple double looping. I don't know much about html. – ABD Jan 12 '21 at 16:23
  • Check my edit. You get it one line of code ;) – Synthase Jan 12 '21 at 16:24
  • I guess it's not fair to downvote the correct answer, so I'll retract the downvote. But parsing it manually (with splits and/or regex) just loses the whole point of using the dedicated parser like Beautifulsoup. And also [don't parse HTML with regex](https://stackoverflow.com/a/1732454/4727702) – Yevhen Kuzmovych Jan 12 '21 at 16:25
  • I will try that too. Thanks :) – ABD Jan 12 '21 at 16:25
  • @Yevhen you are totally right. However, and not for general purpose, it is often easier/faster to scrap using general python. Libs are often great, but then you need to get into the documentation before being at ease with. Sometimes it is necessary, sometimes not, and you get opportunities to improve in general programming. Thanks for retracting the downvote tho! – Synthase Jan 12 '21 at 16:27
1

Would this work?

...

for p in all_p:
    for link in p.find_all('a'):
        print(link['href'])
        print(link.text) # or link['title']
Yevhen Kuzmovych
  • 10,940
  • 7
  • 28
  • 48
0

I was able to do this by this-

for p in all_p:
    for b in p.findAll('a'):                                         
        fullLink = str(b.get('href'))
        title = str(b.get('title'))
        href = 'link = {}, title = {}\n'.format(fullLink, title)
        print(href)

It works fine for me. Thanks

ABD
  • 180
  • 11