1

I am an absolute beginner to Web Scraping using Python and just knowing ver little about programming i Python. I am just trying to extract the information of the lawyers in the Tennesse location. In the webpage ,there are multiple links, within which there are further more links and within those are the various lawyers.

If kindly just could you tell me the steps which I should follow.

I hae done till extracting he links on the first page, but I only need links of the cities whereas I have got all the links with href tags. Now how can I iterate them and proceed further?

from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.select('a')]
print(links)```

It is printing
````C:\Users\laptop\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/laptop/.PyCharmCE2017.1/config/scratches/scratch_1.py
['https://www.superlawyers.com', 'https://attorneys.superlawyers.com', 'https://ask.superlawyers.com', 'https://video.superlawyers.com',.... ````

All the links are extracted whereas I only need the links of the cities. Kindly help.
ag2019
  • 105
  • 8

3 Answers3

2

Without regex:

cities = soup.find('div', class_="three_browse_columns" )
for city in cities.find_all('a'):
   print(city['href'])
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Now I extract the links within each of the links of the city. I have done the following code but it displays error. ```` res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'}) soup = bs(res.content, 'lxml') links = [item['href'] for item in soup.find_all('a',href=re.compile('https://attorneys.superlawyers.com/tennessee/'))] for l1 in links: links2 = l1.find('div', class_="three_browse_columns") for l2 in links2.find_all('a'): lk1=l2['href'] print(lk1)```` Error: find() takes no keyword arguments – ag2019 Jun 06 '19 at 20:07
  • @ag2019 - But you are using regex, it seems. Not sure why, but it's a different approach. – Jack Fleeting Jun 06 '19 at 20:14
1

Use regular expression re and search href value of the city.

from bs4 import BeautifulSoup as bs
import re

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

links = [item['href'] for item in soup.find_all('a',href=re.compile('https://attorneys.superlawyers.com/tennessee/'))]
print(links)

Output:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']

If you want use css selector use the below code.

from bs4 import BeautifulSoup as bs
import requests
res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')
links = [item['href'] for item in soup.select('a[href^="https://attorneys.superlawyers.com/tennessee"]')]
print(links)

Output:

['https://attorneys.superlawyers.com/tennessee/alamo/', 'https://attorneys.superlawyers.com/tennessee/bartlett/', 'https://attorneys.superlawyers.com/tennessee/brentwood/', 'https://attorneys.superlawyers.com/tennessee/bristol/', 'https://attorneys.superlawyers.com/tennessee/chattanooga/', 'https://attorneys.superlawyers.com/tennessee/clarksville/', 'https://attorneys.superlawyers.com/tennessee/cleveland/', 'https://attorneys.superlawyers.com/tennessee/clinton/', 'https://attorneys.superlawyers.com/tennessee/columbia/', 'https://attorneys.superlawyers.com/tennessee/cookeville/', 'https://attorneys.superlawyers.com/tennessee/cordova/', 'https://attorneys.superlawyers.com/tennessee/covington/', 'https://attorneys.superlawyers.com/tennessee/dayton/', 'https://attorneys.superlawyers.com/tennessee/dickson/', 'https://attorneys.superlawyers.com/tennessee/dyersburg/', 'https://attorneys.superlawyers.com/tennessee/elizabethton/', 'https://attorneys.superlawyers.com/tennessee/franklin/', 'https://attorneys.superlawyers.com/tennessee/gallatin/', 'https://attorneys.superlawyers.com/tennessee/germantown/', 'https://attorneys.superlawyers.com/tennessee/goodlettsville/', 'https://attorneys.superlawyers.com/tennessee/greeneville/', 'https://attorneys.superlawyers.com/tennessee/henderson/', 'https://attorneys.superlawyers.com/tennessee/hendersonville/', 'https://attorneys.superlawyers.com/tennessee/hixson/', 'https://attorneys.superlawyers.com/tennessee/huntingdon/', 'https://attorneys.superlawyers.com/tennessee/huntsville/', 'https://attorneys.superlawyers.com/tennessee/jacksboro/', 'https://attorneys.superlawyers.com/tennessee/jackson/', 'https://attorneys.superlawyers.com/tennessee/jasper/', 'https://attorneys.superlawyers.com/tennessee/johnson-city/', 'https://attorneys.superlawyers.com/tennessee/kingsport/', 'https://attorneys.superlawyers.com/tennessee/knoxville/', 'https://attorneys.superlawyers.com/tennessee/la-follette/', 'https://attorneys.superlawyers.com/tennessee/lafayette/', 'https://attorneys.superlawyers.com/tennessee/lafollette/', 'https://attorneys.superlawyers.com/tennessee/lawrenceburg/', 'https://attorneys.superlawyers.com/tennessee/lebanon/', 'https://attorneys.superlawyers.com/tennessee/lenoir-city/', 'https://attorneys.superlawyers.com/tennessee/lewisburg/', 'https://attorneys.superlawyers.com/tennessee/lexington/', 'https://attorneys.superlawyers.com/tennessee/madisonville/', 'https://attorneys.superlawyers.com/tennessee/manchester/', 'https://attorneys.superlawyers.com/tennessee/maryville/', 'https://attorneys.superlawyers.com/tennessee/memphis/', 'https://attorneys.superlawyers.com/tennessee/millington/', 'https://attorneys.superlawyers.com/tennessee/morristown/', 'https://attorneys.superlawyers.com/tennessee/murfreesboro/', 'https://attorneys.superlawyers.com/tennessee/nashville/', 'https://attorneys.superlawyers.com/tennessee/paris/', 'https://attorneys.superlawyers.com/tennessee/pleasant-view/', 'https://attorneys.superlawyers.com/tennessee/pulaski/', 'https://attorneys.superlawyers.com/tennessee/rogersville/', 'https://attorneys.superlawyers.com/tennessee/sevierville/', 'https://attorneys.superlawyers.com/tennessee/sewanee/', 'https://attorneys.superlawyers.com/tennessee/shelbyville/', 'https://attorneys.superlawyers.com/tennessee/somerville/', 'https://attorneys.superlawyers.com/tennessee/spring-hill/', 'https://attorneys.superlawyers.com/tennessee/springfield/', 'https://attorneys.superlawyers.com/tennessee/tullahoma/', 'https://attorneys.superlawyers.com/tennessee/white-house/', 'https://attorneys.superlawyers.com/tennessee/winchester/', 'https://attorneys.superlawyers.com/tennessee/woodlawn/']
KunduK
  • 32,888
  • 5
  • 17
  • 41
1

Faster would be use to use a parent id then select a tags within

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://attorneys.superlawyers.com/tennessee/')
soup = bs(r.content, 'lxml')
cities = [item['href'] for item in soup.select('#browse_view a')]
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Could you just explain the significance of '#' before browse_view? – ag2019 Jun 07 '19 at 07:07
  • Now within each of these links there are more links, how can it be accessed? That is for example within the Alamo region there are three categories of lawyers available, I want those links for and within those three categories are the lawyers details which I want to fetch. What can be done? if possible suggest. – ag2019 Jun 07 '19 at 07:23
  • I have tried this `import requests from bs4 import BeautifulSoup as bs res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'}) soup = bs(res.content, 'lxml') cities = [item['href'] for item in soup.select('#browse_view a')] for c1 in cities: categories = [item['href'] for item in c1.select('three_browse_columns a')] print(categories)` But it gives an error `categories = [item['href'] for item in c1.select('three_browse_columns a')] AttributeError: 'str' object has no attribute 'select'`. What can be done? If possible suggest. – ag2019 Jun 07 '19 at 07:29
  • The # is a css id selector https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors – QHarr Jun 07 '19 at 09:41
  • cities is a list of hyperlinks to those cities i.e.strings. So, c1 is a string not a tag. – QHarr Jun 07 '19 at 09:43
  • And for Alamo 3 types: soup.select('.three_browse_columns:nth-of-type(2) li') – QHarr Jun 07 '19 at 09:48
  • Okay. Thanks!. I have done this change in my code `cities = [item['href'] for item in soup.select('#browse_view a')] for c in cities: r=requests.get(c) s1=bs(r.content,'lxml') categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')] print(categories)` But this is only printing the category links of one city, whereas the cities list contains more links, those are not being iterated. That is the loop iterates only once over the list cities and only returns the links of the last city woodlawn. What can be done? If possible suggest. – ag2019 Jun 08 '19 at 10:46
  • As I am a beginner just asking did you use the CSS selector nth-of-type(2) to identify the colour blue that bounds the link region? Where can I see it when I click on the Inspect option? If possible suggest. – ag2019 Jun 08 '19 at 10:56
  • Right click inspect, select any html in right hand side. Then press Ctrl + F to bring up search box (for html) and enter the css selector there. See help here: https://stackoverflow.com/a/56345695/6241235 – QHarr Jun 08 '19 at 11:43
  • I can't read code in comments. Please consdider opening a new question and drop me a link to it here. – QHarr Jun 08 '19 at 11:54
  • Okay, thanks for explaining the CSS selector. I have posted a new question with my query. If possible see it here: [link](https://stackoverflow.com/questions/56506892/fetching-lawyer-details-from-multiple-links-using-bs4-in-python). – ag2019 Jun 08 '19 at 13:34