Making XML parser faster

Question

I am currently writing a script that parses through a xml webpage using beautifulsoup. An example of the xml file is here. The script basically will output the first product URL (from each 'loc' tag) which matches a list of keywords that have been inputted. Currently, the script's control flow is the following:

pass the URL into a soup object and beautify it

run a for loop for each url tag, and put each loc text into a list (inventory_url)

for item in soup.find_all('url'):
        inventory_url.append(item.find('loc').text)

iterate through the list, and output the first element that matches all keywords, where 'keywords' is the inputted list of keywords
```
    for item in inventory_url:
        if all(kw in item for kw in keywords):
            return item
```

I am wondering if there is a way to make the parsing faster. I have looked at soupstrainer, but when I isolate to only find 'loc' tags, it also takes in 'image:loc' tags, which I do not need.

Thank you very much.

Did you use [lxml](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) to parse the xml ? — Maurice Meyer, Jan 05 '17 at 20:09
Are you sure that parsing is the bottleneck? If you are getting the webpages from the internet, I suspect that takes more than factor 1000 longer than parsing the page. — Jonas Schäfer, Jan 05 '17 at 20:10
use `re` ? could be faster and likely powerful enough.. (note, I'm not too familiar with webscraping or beautifulsoup so I'm not sure if you'd run into slowdowns converting to a text stream to run your regex on) — Aaron, Jan 05 '17 at 20:11

score 0 · Answer 1 · answered Jan 05 '17 at 20:29

If you can stream the file as simple text, I presume regex would be pretty fast...

import re

pattern = re.compile(r'<url>[\S\s]*?<loc>([\S\s]*?)</loc>[\S\s]*?</url>')

for match in re.finditer(pattern, file.read()):
     #do stuff

the [\S\s]*? is a lazy way to match literally anything until we hit what comes next. The ? is crucial to not making this break.

score -1 · Answer 2 · edited May 23 '17 at 12:08

-1

Have you tried a different parser? https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

Also see tips in: Speeding up beautifulsoup

edited May 23 '17 at 12:08

Community

1
1

answered Jan 05 '17 at 20:11

meinaart

107
4

Making XML parser faster

2 Answers2