0

I am trying to develop a regex expression to extract the address, sale date and sale price information from this string:

<strong id="address">1245 DUPONT ST</strong><br>Toronto : Metro Toronto<br>14 Aug 2015&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$71,000,000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="#CC0000"></font>

Ideally, I would like to receive the information formatted on five separate rows like this:

1245 DUPONT ST  
Toronto  
Metro Toronto  
14 Aug 2015  
$71,000,000  

I suspect that the solution will involve the use of positive lookbehind because the address information can always by identified by the id="address", but I can't seem to get it working. Any help would be greatly appreciated. Thanks.

DanielAttard
  • 3,467
  • 9
  • 55
  • 104

1 Answers1

1

I don't agree on having a regex to parse xml and would use a html parser instead.

However, for your specific example I can come up with this regex that works on PCRE engine:

id="address">(.*?)<|<br>(.*?) : (.*?)<br>|(?<=<br>)(.*?)&|(\$[^&]+)

Working demo

Match information:

MATCH 1
1.  [21-35] `1245 DUPONT ST`
MATCH 2
2.  [48-55] `Toronto`
3.  [58-71] `Metro Toronto`
MATCH 3
4.  [75-86] `14 Aug 2015`
MATCH 4
5.  [122-133]   `$71,000,000`
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Wow, that indeed does work. Amazing stuff @Fede. Why do you say that you disagree with using regex to parse XML? I don't understand anything about what a regex parser is. Perhaps I should read-up on that. – DanielAttard Aug 21 '15 at 00:08
  • @DanielAttard you might want to see this question with it's accepted answer: https://stackoverflow.com/a/1732454/4464702 He's also trying to regex-match in HTML. – randers Aug 21 '15 at 05:37
  • @DanielAttard you can use regex to parse xhtml **only** if you know what the characters set is, in that case regex is not a bad solution. However, if you use xml then XPath, XQuery are the right approach, on the other hand if you use html then html parsers are a good choice. Anyway, if you don't want to involve a new framework or library, then a simple regex might do the trick. – Federico Piazza Aug 21 '15 at 15:29