Why do these RegExes not scraping whole words/strings?

Question

I am trying to use the Google Regex Scraper extension to web scrape some items from the Yelp! website. Trying to use this regex to match both US street addresses without parsing. Sorry for the previous confusion

6805 Vista Del Mar Ln

1320 E 200 S

\<span\sitemprop\=\"streetAddress\"\>\"?(\d{1,5}\s[NEWS]?\s?\w*\s\w*\s?\w*?\s?\w*?\"?)\<?b?r?\>?\"?\w+?\s?\w+?\"?\<\/span\>

Help anyone?

I recommend not using regex to parse HTML, but use an actual parser instead. Regexes like these are not easily done right and always difficult to inderstand. — Bert Peters, Oct 04 '15 at 16:29
Your regex suggests you are parsing HTML with this but your sample string doesn't have HTML. What language are you running this in, have you looked at parsers? — chris85, Oct 05 '15 at 03:23
[you cannot parse html with regex](http://stackoverflow.com/a/1732454/4342498) — NathanOliver, Oct 06 '15 at 19:46

score 0 · Accepted Answer · answered Oct 05 '15 at 06:19

Your "question" is lacking a lot of information, but from what I gather you want to read the address inside a <span> tag with optional " around it and followed by an optional <br> followed by something your not interested in... Your RE seems overly complex, unless there are some syntax checking involved (not mentioned in the question either). How about this:

<span\b.*?>"?(\d{1,5}.*?)"?(?:<br>|<\/span>)

It keeps the only obvious syntax check you have, namely the street number being present, 1 to 5 digits, but excepts for that grabs everything up to either a <br> or </span>, excluding surrounding quotes. Your test for North, East... doesn't really do any thing. And all the other "chopping up" of the RE goes beyond my understanding.

But, as the comments say, use a HTML-parser to extract the text you want to interpret.

Anyway, gave it a try ;)

Regards

Why do these RegExes not scraping whole words/strings?

1 Answers1