Html Parser pulling from previous webpage

Question

I have a script that loads a page and saves a bunch of data ids from multiple containers. I then want to open up new urls appending those said data ids onto the end of the urls. For each url I want to locate all the hrefs and compare them to a list of specific links and if any of them match I want to save that link and a few other details to a table.

I have managed to get it to open the url with the appended data id but when I try to search for elements in the new page it either pulls them from the first url that was parsed if I try to findAll from soup again or I constantly get this error when I try to run another html.parser.

ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Is it not possible to run another parser or am I just doing something wrong?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.action_chains import ActionChains

url = "http://csgo.exchange/id/76561197999004010#x"

driver = webdriver.Firefox()

driver.get(url)
import time
time.sleep(15)
html = driver.page_source
soup = soup(html, "html.parser")

containers = soup.findAll("div",{"class":"vItem"})

print(len(containers))
data_ids = [] # Make a list to hold the data-id's

for container in containers:
    test = container.attrs["data-id"]
    data_ids.append(test) # add data-id's to the list
    print(str(test))

for id in data_ids:
    url2 = "http://csgo.exchange/item/" + id
    driver.get(url2)
    import time
    time.sleep(2)   
    soup2 = soup(html, "html.parser")
    containers2 = soup2.findAll("div",{"class":"bar"})
    print(str(containers2))

with open('scraped.txt', 'w', encoding="utf-8") as file:
    for id in data_ids:
        file.write(str(id)+'\n') # write every data-id to a new line

The page source for that first URL (http://csgo.exchange/id/76561197999004010#x) doesn't have any div's with a class of vItem is the first thing I notice. How are you getting any results the first time around? For your question an example ID or two might be helpful because then we could go to the URL and view the page source. — C. Peck, Mar 10 '19 at 07:37
There are about 885 divs with a class of vItem. I'm not having any problems getting the ids. Nor did the person who previously helped me with my last issue. But here are some examples. 15653916980 15653916960 15631554103 — CodeOrDie, Mar 10 '19 at 07:40
I'm assuming when you went to load the page it didn't fully load. Sometimes the page can hang and other times it opens right up. I plan on making it wait until the element is there before it proceeds in the future but I do not know how to do that as of right now so the 15 second sleep is a place holder. — CodeOrDie, Mar 10 '19 at 07:42
What I'm really try to pull is all the hrefs in the flow history on each page like this. http://csgo.exchange/item/15653916980 And then I want to compare each one of those to a list of links to see if any of them match. — CodeOrDie, Mar 10 '19 at 07:47

QHarr · Accepted Answer · 2019-03-10T09:20:27.977

0

Not sure exactly what you want from each page. You should add waits. I add waits looking for hrefs in the flow history section of each page (if present). It should illustrate the idea.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'http://csgo.exchange/id/76561197999004010'
driver = webdriver.Chrome()
driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
baseURL = 'http://csgo.exchange/item/'
for id in ids:
    url = baseURL + id
    driver.get(url)
    try:
        flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
        results.append([id, flowHistory])
    except:
        print(url)

edited Mar 10 '19 at 09:20

answered Mar 10 '19 at 09:14

QHarr

83,427
12
54
101

1

Dang. Okay so this is a whole different approach without even using BS4. It's working quite well other than when it comes up to an item that doesnt have any hrefs. It pauses for an extended period of time but does eventually continue. I'm assuming that has to do with the try method. Though I don't know why it would hang for so long. Needless to say I want to eventually filter out a lot of those junk items such as cases, stickers, and medals. – CodeOrDie Mar 10 '19 at 09:34
It is the time I give it to wait. You could reduce that from 10 to a lower number. This line: [item.get_attribute('href') for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]"))) – QHarr Mar 10 '19 at 09:37
As far as what I want from each page. It would be the item name thats located in the div class "bar". The link to that item which is the link it visited in the first place. And then all the hrefs in the flow history which it's already getting. And then eventually I want to compare those hrefs to a list of links and only save the ones that matched. But I do appreciate the help. I wouldn't have figured this out on my own. I was starting to bang my head against the wall. – CodeOrDie Mar 10 '19 at 09:38
Oh I gotcha. It goes quick when it loads but hangs if its not there. Makes sense. Yeah I'll just drop that to like 3 seconds. – CodeOrDie Mar 10 '19 at 09:40

score 0 · Answer 2 · answered Mar 10 '19 at 10:05

0

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'http://csgo.exchange/id/76561197999004010'
profile = webdriver.FirefoxProfile()
profile.set_preference("permissions.default.image", 2) # Block all images to load websites faster.
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
baseURL = 'http://csgo.exchange/item/'
for id in ids:
    url = baseURL + id
    driver.get(url)
    try:
        pros = ['http://csgo.exchange/profiles/76561198149324950']
        flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,3).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
        if flowHistory in pros:
            results.append([url,flowHistory])
            print(results)
    except:
        print()

answered Mar 10 '19 at 10:05

CodeOrDie

41
1
8

As far as checking if the flowHistory hrefs match a list item. Am I doing this right? Because the first item it checks should be printing http://csgo.exchange/profiles/76561198149324950 but I'm not getting anything in return. – CodeOrDie Mar 10 '19 at 10:06
Okay I got a little further by using if any(elem in flowHistory for elem in pros): but it appends all the hrefs for that item and not just the ones that matched. So I need to figure out how to just get the matched ones to append and I'm good. – CodeOrDie Mar 10 '19 at 10:36
Using for string in pros: if string in flowHistory match = string it returns the first matched href but it doesn't return more if there are more. I keep getting closer but still not there. – CodeOrDie Mar 10 '19 at 11:09
I've been doing chores. Need anything further? – QHarr Mar 10 '19 at 12:20
Sorry I was up all night so I've been sleeping. I'll post the code I have now but like I said it's only returning one of the matched results and not all of them. – CodeOrDie Mar 10 '19 at 19:47
So, can you give me one url and the expected match count please? – QHarr Mar 10 '19 at 19:54
Both of the urls that are located in the list pros should match with the hrefs that are pulled from the first item it opens. As for the expected match count. That's hard to say because eventually I want to compare it a list of like 100 different urls or more and they could all match. But in the case right now the expected match count would be 2. http://csgo.exchange/profiles/76561198149324950 http://csgo.exchange/profiles/76561198152970370 – CodeOrDie Mar 10 '19 at 20:00
Sorry. I mean for one test url what are the expected urls to return (before matching) It's so I can debug against expected values. – QHarr Mar 10 '19 at 20:02
that's the expected links be returned from which url? This one? http://csgo.exchange/profiles/76561198149324950 – QHarr Mar 10 '19 at 20:09
Thanks. I will have a look. – QHarr Mar 10 '19 at 20:12
I make it 151 urls to be returned not your 188. – QHarr Mar 10 '19 at 20:16
Hmm I don't know why it would be different. – CodeOrDie Mar 10 '19 at 20:22
That is what my code returns if you completely flatten the nested lists. https://pastebin.com/A8wejFxz – QHarr Mar 10 '19 at 20:23
If you put my css selector in the find window of dev tools you will see it matches 151. You can cycle through each match. It highlights each icon. – QHarr Mar 10 '19 at 20:23
Well it should be returning 151 because that's exactly how many items are in the flow history for that item. – CodeOrDie Mar 10 '19 at 20:29
run the pastebin code where I unflatten the list my code returns. If you don't unflatten the len will be 1. – QHarr Mar 10 '19 at 20:30
Yeah I'm running it now and it printed 151. But I counted out all the links I got and it was 151 as well so where did you get 188 from? – CodeOrDie Mar 10 '19 at 20:36
the list you gave me – QHarr Mar 10 '19 at 20:37
1

I manually counted all of them on justpaste and got 151 plus the first url which was the item itself so 152. But anyways that's not really important. – CodeOrDie Mar 10 '19 at 20:40
So are you saying the list has to be unflattened first? I'd still need a method to return all the matches. The method I have now only looks for the first string that matches. – CodeOrDie Mar 10 '19 at 20:41
you can use [_any_ ](https://stackoverflow.com/a/55079255/6241235) with a generator whilst looping full results list and compare against your pre-determined list. I would probably also use set first to ensure no duplicates in full result list. Then flip back to a list. And yes, you can do the unflattening when adding during your original loop or re-write to use a loop rather than list comprehension. – QHarr Mar 10 '19 at 20:43

CodeOrDie · Answer 3 · 2019-03-10T20:11:11.100

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

urls = ['http://csgo.exchange/id/76561197999004010']
profile = webdriver.FirefoxProfile()
profile.set_preference("permissions.default.image", 2) # Block all images to load websites faster.
driver = webdriver.Firefox(firefox_profile=profile)
for url in urls:
    driver.get(url)
ids = [item.get_attribute('data-id') for item in WebDriverWait(driver,30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-id]")))]
results = []
pros = ['http://csgo.exchange/profiles/76561198149324950', 'http://csgo.exchange/profiles/76561198152970370']
baseURL = 'http://csgo.exchange/item/'
for id in ids:
    url = baseURL + id
    driver.get(url)
try:

    flowHistory = [item.get_attribute('href') for item in WebDriverWait(driver,2).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#tab-history-flow [href]")))]
    match = []
    for string in pros:
        if string in flowHistory:
            match = string
            break

    if match:
        pass 

        results.append([url,match])
        print(results)
except:
    print()

Html Parser pulling from previous webpage

3 Answers3