0

background:

I am learning about web scraping and decided to use python and beautiful soup to scrape, this program will ask the user for a link and will narrow down their HTML search in the webpage.

problem:

When I ask the user to define their own extension for soup page( EX .div.div.a ) and I append this to the whole string and try and execute it in a print function it always returns None. How would I go about running the extension and printing it from the gathered user input? for this example, I am scraping a Newegg search for graphics cards.

example link:https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20cards

keep in mind from the code below, I had already used findAll for div class="item-info", so it would be searching the extension in that code block.

I have already tried exec() the string but this does not seem to work

isdone = ""
while isdone != "done":
    try:
        route = "container"
        userinput = input("what extensions would you like to search for?\n seperate each denotion with a space \n ex: div div img[\"title\"]\n: ")
        inputRoute = userinput.split(' ')
        for i in range(len(inputRoute)):
            route += "." + inputRoute[i]        
        print("---\n"+route+"\n---")
        print("Current Route ^\n---")
        print("output:\n", exec(route),"\n---")#actual resaults if user had inputed a
        print(container.a) # what i actually want to output (if the user only inputed a) 
        #add the abilitie to add extensions ex: container.div.a.img["foo"] -ignore this stackoverflow
        isdone = input("are you happy with these extensions? \n type 'done' when happy\n or enter to change extension\n: ") 
    except Exception as e:
        print(e)
        input("Make sure their is no leftover spaces\npress enter to continue")

'#' are my comments throughout output THIS IS CONSOLE OUTPUT:

'what extensions would you like to search for?
 seperate each denotion with a space
 ex: div div img["title"]
: a                 #  <--what I put in the input
---
container.a #what  
---
Current Route ^
---
output:              
 None               #  <-- what actually outputs when i use exec()
---
<a class="item-brand" href="https://www.newegg.com/EVGA/BrandStore/ID-1402"> 
<img alt="EVGA" class="lazy-img" data-effect="fadeIn" data-src="//c1.neweggimages.com/Brandimage_70x28//Brand1402.gif" src="//c1.neweggimages.com/WebResource/Themes/2005/Nest/blank.gif" title="EVGA">
</img></a>
are you happy with these extensions?
 type 'done' when happy
 or enter to change extension
:'
Ki Durrer
  • 3
  • 3

1 Answers1

0

If container is your BeautifulSoup object, then eval('container.a') will return a list of all the <a> tags. Using eval or exec is probably not a good idea in your case, however, see Why should exec() and eval() be avoided?
I recommend using find_all and its attrs parameter instead, though parsing the input will probably turn out to be a good deal harder than you currently anticipate.

VlB
  • 46
  • 1
  • 1
  • 5
  • @VIB I will try to use find, I was just hoping there was an easy solution to grabbing the data, for instance, if you did page.div.div.img["title"] it only gives you the title and none of the HTML tags that follow it, I'm not sure if you can do that with find_all()/find() I appreciate the response, I can see how using exec() could be bad, I will use find_all() instead now, thanks. – Ki Durrer May 28 '19 at 01:31
  • You could look into using `element.get('')` and `element.get_attribute_list('')` on results of `find_all`. Works like Python's dictionary `get`. – VlB May 28 '19 at 10:59