Efficient regex parsing of html

Question

I have a piece of Python code scrapping datapoints value from what seems to be a Javascript graph on a webpage. The data looks like:

...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...

where the dots are plotting data I skipped.

To scrap the useful information - x/y datapoints coordinates - I used regex:

#first getting the raw x data
xData = re.findall("'x':\d+", htmlContent)
#now reading each value one by one
xData = [int(re.findall("\d+",x)[0]) for x in xData]

Same for the y values. I don't know if this terribly inefficient but it does not look pretty or very smart as a have many redundant calls to re.findall. Is there a way to do it in one pass? One pass for x and one pass for y?

You should use an [HTML Parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). [Don't use regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — styvane, Jul 18 '16 at 15:14
Try [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — m_callens, Jul 18 '16 at 15:15
@SSDMS: I have been using `BeautifulSoup` but this part of the page seems to be more easily done through `regex` — Learning is a mess, Jul 18 '16 at 15:16
Use an HTML parser to fetch just the JavaScript portion you care about, and then use a grouping regular expression to fetch both the x and y values from that particular data structure in a single pass. You don't provide enough information to provide anything more specific. — Feneric, Jul 18 '16 at 15:21

score 1 · Accepted Answer · edited Jul 18 '16 at 16:50

1

You can do it a little bit easier:

htmlContent = """
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
"""
# Get the numbers
xData = [int(_) for _ in re.findall("'x':(\d+)", htmlContent)]
print xData

edited Jul 18 '16 at 16:50

styvane

59,869
19
150
156

answered Jul 18 '16 at 15:19

Ohumeronen

1,769
2
14
28

Reasons for the downvote? The author wants a solution without BeautifulSoup but with regex instead. – Ohumeronen Jul 18 '16 at 15:30
@pawelty: Thank you, you made my day ;-) – Ohumeronen Jul 18 '16 at 15:40
Thank you, this is what I wanted: extracting all the numbers, stripped of the 'x' part in one pass. – Learning is a mess Jul 18 '16 at 15:53
1

Glad I could help. By using the braces within the regex you basically just say that you want to keep only what is between these braces. – Ohumeronen Jul 18 '16 at 15:58

Efficient regex parsing of html

1 Answers1