0

I am currently writing a script that parses through a xml webpage using beautifulsoup. An example of the xml file is here. The script basically will output the first product URL (from each 'loc' tag) which matches a list of keywords that have been inputted. Currently, the script's control flow is the following:

  • pass the URL into a soup object and beautify it
  • run a for loop for each url tag, and put each loc text into a list (inventory_url)

    for item in soup.find_all('url'):
            inventory_url.append(item.find('loc').text)
    
  • iterate through the list, and output the first element that matches all keywords, where 'keywords' is the inputted list of keywords

        for item in inventory_url:
            if all(kw in item for kw in keywords):
                return item
    

I am wondering if there is a way to make the parsing faster. I have looked at soupstrainer, but when I isolate to only find 'loc' tags, it also takes in 'image:loc' tags, which I do not need.

Thank you very much.

JC1
  • 849
  • 13
  • 25
  • Did you use [lxml](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) to parse the xml ? – Maurice Meyer Jan 05 '17 at 20:09
  • Not yet. Thank you for your suggestion – JC1 Jan 05 '17 at 20:10
  • 1
    Are you sure that parsing is the bottleneck? If you are getting the webpages from the internet, I suspect that takes more than factor 1000 longer than parsing the page. – Jonas Schäfer Jan 05 '17 at 20:10
  • use `re` ? could be faster and likely powerful enough.. (note, I'm not too familiar with webscraping or beautifulsoup so I'm not sure if you'd run into slowdowns converting to a text stream to run your regex on) – Aaron Jan 05 '17 at 20:11

2 Answers2

0

If you can stream the file as simple text, I presume regex would be pretty fast...

import re

pattern = re.compile(r'<url>[\S\s]*?<loc>([\S\s]*?)</loc>[\S\s]*?</url>')

for match in re.finditer(pattern, file.read()):
     #do stuff

the [\S\s]*? is a lazy way to match literally anything until we hit what comes next. The ? is crucial to not making this break.

Aaron
  • 10,133
  • 1
  • 24
  • 40
-1

Have you tried a different parser? https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

Also see tips in: Speeding up beautifulsoup

Community
  • 1
  • 1
meinaart
  • 107
  • 4