How do a loop through a list of URLs, follow each link, and pull content into array

Question

I have pulled data from API, I'm am looping through everything and find a key: value that has a URL. So I am creating a separate list of the URLs, what I need to do is follow the link and grab the contents from the page, pull the contents of that page back in to the array/list (it will just be a paragraph of text) and of course loop through the remaining URL. Do I need to use Selenium or BS4 and how do I loop through and pulling the page contents into my array/list?

json looks like this:

{
"merchandiseData": [
    {
        "clientID": 3003,
        "name": "Yasir Carter",
        "phone": "(758) 564-5345",
        "email": "leo.vivamus@pedenec.net",
        "address": "P.O. Box 881, 2723 Elementum, St.",
        "postalZip": "DX2I 2LD",
        "numberrange": 10,
        "name1": "Harlan Mccarty",
        "constant": ".com",
        "text": "deserunt",
        "url": "https://www.deserunt.com",
        "text": "https://www."
    },
]
}

Code thus far:

import requests
import json
import pandas as pd 
import sqlalchemy as sq
import time
from datetime import datetime, timedelta
from flatten_json import flatten# read file
with open('_files/TestFile2.json', 'r') as f:
    file_contents = json.load(f)
allThis = []
for x in file_contents['merchandiseData']:
    holdAllThis = {
        'client_id' : x['clientID'],
    'client_description_link' : x['url']
    }
    allThis.append(holdAllThis)
    print(client_id, client_description_link)
print(allThis)

`allThis` list has client id and url ? Is this the sample json ? If not, how many JSON array you have in `_files/TestFile2.json` ? Do you just want to take each of the URL and extract some text from the UI ? — cruisepandey, Aug 22 '21 at 05:02
If the URLs are different, you should write different code to scrape each URL. Whether to use bs4 or selenium, that depends on the webpage - *If the page is loaded by JavaScript , use selenium else bs4 will do.* — Ram, Aug 22 '21 at 08:21
HI @cruisepandey. I have attached a sample of the json file here: {https://github.com/webdevr712/python_follow_links.git} when the script above runs, i get the list of clientID and url and yes the file is json. I want to then take that list, run a python script against it go crawl each of urls, and pull the content back into an array, — user1176783, Aug 22 '21 at 14:47
Hi @Ram - not sure i follow. There will be around 10K of these urls, so I can't have a separate code for each. The script would need to loop through, — user1176783, Aug 22 '21 at 14:48
From your JSON data, I see that the URLs are different. How do you plan to scrape all those URLs ? BTW, what data are you trying to pull from those URLs ? — Ram, Aug 22 '21 at 14:57

score 1 · Accepted Answer · edited Aug 23 '21 at 11:26

Maybe using the JSON posted at https://github.com/webdevr712/python_follow_links and pandas:

import pandas as pd
import requests

# function mostly borrowed from https://stackoverflow.com/a/24519419/9192284
def site_response(link):
    try:
        r = requests.get(link, headers=headers)

        # Consider any status other than 2xx an error
        if not r.status_code // 100 == 2:
            return "Error: {}".format(r)

        return r.reason
    except requests.exceptions.RequestException as e:
        # A serious problem happened, like an SSLError or InvalidURL
        return "Error: {}".format(e)

# set headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}

# url for downloading the json file
url = 'https://raw.githubusercontent.com/webdevr712/python_follow_links/main/merchData.json'

# get the json into a dataframe
df = pd.read_json(url)
df = pd.DataFrame(df['merchandiseData'].values.tolist())

# new column to store the response from running the site_response() function for each string in the 'url' column
df['site_response'] = df.apply(lambda x: site_response(x['url']), axis=1)

# print('OK responses:')
# print(df[df['site_response'].str.contains('OK')])

# output
print('\n\nAll responses:')
print(df[['url', 'site_response']])

Output:

All responses:

    url                         site_response
0   https://www.deserunt.com    Error: HTTPSConnectionPool(host='www.deserunt....
1   https://www.aliquip.com     Error: HTTPSConnectionPool(host='www.aliquip.c...
2   https://www.sed.net         Error: <Response [406]>
3   https://www.ad.net          OK
4   https://www.Excepteur.edu   Error: HTTPSConnectionPool(host='www.excepteur...

Full frame output:

  clientID               name           phone  \
0      3003       Yasir Carter  (758) 564-5345   
1      3103  Elaine Mccullough  1-265-168-1287   
2      3203      Vanna Elliott  (113) 485-7272   
3      3303    Adrienne Holden  1-146-431-3745   
4      3403         Freya Vang  (858) 195-4886   

                                       email  \
0                    leo.vivamus@pedenec.net   
1                sodales@enimcondimentum.net   
2                             elit.a@dui.org   
3  lacus.quisque@magnapraesentinterdum.co.uk   
4                  diam.dictum@velmauris.net   

                             address    postalZip  numberrange  \
0  P.O. Box 881, 2723 Elementum, St.     DX2I 2LD           10   
1                      7529 Dui. St.  24768-76452            9   
2           Ap #368-6127 Lacinia Av.         6200            5   
3           Ap #522-3209 Euismod St.        66746            3   
4          P.O. Box 159, 416 Dui Ave       158425            4   

             name1 constant          text                        url  \
0   Harlan Mccarty     .com  https://www.   https://www.deserunt.com   
1  Kaseem Petersen     .com  https://www.    https://www.aliquip.com   
2  Kennan Holloway     .net  https://www.        https://www.sed.net   
3  Octavia Lambert     .net  https://www.         https://www.ad.net   
4    Kitra Maynard     .edu  https://www.  https://www.Excepteur.edu   

                                       site_response  
0  Error: HTTPSConnectionPool(host='www.deserunt....  
1  Error: HTTPSConnectionPool(host='www.aliquip.c...  
2                            Error: <Response [406]>  
3                                                 OK  
4  Error: HTTPSConnectionPool(host='www.excepteur...

From there you can move on to scraping each site that returns 'OK' and use Selenium (if required - you could check with another function) or BS4 etc.

How do a loop through a list of URLs, follow each link, and pull content into array

1 Answers1