1

I'm try to parse some html content, here's the HTML content:

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

So to parse this and get the "Event Name", "Event Time" and "Stream Number", I'm doing this:

preg_match_all('/<\/font>\s*([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2}).*?tream\s*(.*?)\s*<\/font><p>/', $data, $matches);

And It returns everything correctly, however stream number with http link is also returned which i don't want. I just want the name (For some) & number only.

Data Needed:

5
4
CHANNEL TWO 2 STREAM
16
2
CHANNEL THREE 3 STREAM

Currently it returns:

5
4
-online.html
16
2
-online.html

Can anyone please help? Not a pro in regex, been trying for last 2 days. Thanks in advance!!!

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
D-M
  • 13
  • 4
  • I would suggest first extracting the element text as one step and then removing what you don't want. Even better would be to use something like this to actually parse the HTML: http://simplehtmldom.sourceforge.net/ (it's made for that you know...) In either case, once you have the text content of the tag, you can then just regex away the part you don't want. – Brad Peabody Aug 25 '13 at 23:29
  • 2
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Aug 26 '13 at 01:04
  • @bgp I know that, but i find it more easier to parse and extract the content i need directly using regex as the other methods are a longer process and i do understand if the content/structure of the layout changes, my code will be break, But I have no issues in updating it again. Anyway thank you for commenting and suggesting me that. Maybe in next project i might use it. – D-M Aug 26 '13 at 08:10
  • @Andy Lester The above one goes out to you too (SO was not allowing me to tag you in above comment). Thanks for your comment and suggestion as well. – D-M Aug 26 '13 at 08:14
  • Don't vandalize your question please. If you got the answer/solution, just mark it accepted. – BalusC Aug 26 '13 at 18:24

3 Answers3

1

But, if you want it in regex then based on your data you need this

preg_match_all('/(?:<\/font> )((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)([^<]+) <[^>]+>(?:Stream )?([^h<]+)/', $data, $matches);

This will put the names in $matches[1], the times in $matches[2] and the channels in $matches[3]


Explanation of the regex:

  1. (?:<\/font> ) search for (and ignore) first closing font tag on new line, include the space
  2. ((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*) grab everything that's not one or two numbers unless said numbers are followed by a dot or colon (use negative lookahead), repeat as needed and group as one
  3. ([^<]+) grab everything up to the next "<", but not the trailing space
  4. <[^>]+> ignore everythign untill the next ">" and ignore the ">" as well
  5. (?:Stream )? if first word is "Stream " ignore it
  6. ([^h<]+) grab everything untill either a lower-case "h" or a "<"
vollie
  • 1,005
  • 1
  • 11
  • 20
  • Thanks for the solution, It leaves out few of the event names and shows up in place of TIME. Denomales solution seems to be working good. But i appreciate your help :) – D-M Aug 26 '13 at 12:48
  • That's odd, worked just fine for me (see http://ideone.com/WdYasf )? Mayhaps something in outside the above text that broke it. Anyway, you're welcome amd glad you found a solution. – vollie Aug 26 '13 at 17:49
  • Well your solution works mostly, however it leaves out some of the events and adds to the TIME list. (The original content i mean, the above is just a sample data). And the other solution seems to work with sample data but not with the real data, maybe you might have a solution? check out my [comment](http://stackoverflow.com/questions/18434683/php-regex-how-to-ignore-http-link-in-string-and-return-everything-else/18435819?noredirect=1#comment27111665_18435819) – D-M Aug 26 '13 at 17:56
  • you'd need to provide the real data somehow (link to uploaded text / html maybe?), can't determine the issue otherwise – vollie Aug 26 '13 at 17:58
  • Nevermind, I got it working finally! Had to use `addslashes` function and it did worked. Thanks again for your time and help! – D-M Aug 26 '13 at 18:18
0

Description

This expression will:

  • find all the font tags which have class "gold"
  • skip over the word Stream if it is the first word
  • capture the interesting text
  • stop capturing when it reaches an http:// link

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?\K(?:(?!\s*https?:|<\/font>).)*

enter image description here

Examples

Live Demo hover over the blue blocks to see why they were matched

Sample Text

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

Matches

[0] => 5
[1] => 4
[2] => CHANNEL TWO 2 STREAM
[3] => 16
[4] => 2
[5] => CHANNEL THREE 3 STREAM
animuson
  • 53,861
  • 28
  • 137
  • 147
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
0

Description

This expression will:

  • Capture the title
  • Capture the Event Name
  • Capture the event time
  • find all the font tags which have color=gold
  • skip over the word Stream if it exists
  • capture the interesting text
  • stop capturing when it reaches an http:// link
  • Trim pesky white space from around the matches
  • Overall the expression allows the font tag attributes to appear anywhere inside the font tag. And the expression will avoid some really difficult edge cases

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?green['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)

Examples

Live Demo

Sample Text

Group 0 gets the entire match
Group 1 gets the title
Group 2 gets the event name
Group 3 gets the event time
Group 4 gets the stream number

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

PHP Code Example

<?php
$sourcestring="your source string";
preg_match_all('/<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?green[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?gold[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

[0][0] = <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5
[0][1] = *TITLE* 
[0][2] = Some Event Name
[0][3] = 1:15pm-5:00pm
[0][4] = 5

[1][0] = <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4
[1][1] = *TITLE* 
[1][2] = Some: Event Name
[1][3] = 1:30pm-5:00pm
[1][4] = 4

[2][0] = <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM
[2][1] = *TITLE* 
[2][2] = Some, Event Name 1 with num
[2][3] = 1:30pm-7:30pm
[2][4] = CHANNEL TWO 2 STREAM

[3][0] = <font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16
[3][1] = *TITLE* 
[3][2] = Event two
[3][3] = 2.45pm-4.45pm
[3][4] = 16

[4][0] = <font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2
[4][1] = *TITLE* 
[4][2] = Event THREE summary
[4][3] = 2.45pm-4.45pm
[4][4] = 2

[5][0] = <font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM
[5][1] = *TITLE* 
[5][2] = Event with a lot of summary
[5][3] = 4:00pm-6:00pm
[5][4] = CHANNEL THREE 3 STREAM
animuson
  • 53,861
  • 28
  • 137
  • 147
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Wow, This is perfect. Thank you! But i think this has to be escaped as I'm unable to use it in PHP. Had to add to a text file and then fetch from that. How can i escape this online? any idea? – D-M Aug 26 '13 at 12:11
  • @D-M I added a php code example to this solution, have a look. – Ro Yo Mi Aug 26 '13 at 12:14
  • Ok that error is gone now. (Removed some extra slashes and it gone) but it doesn't seem to work. :| Trying here.. – D-M Aug 26 '13 at 12:56
  • Nevermind, It works now finally! Had to use `addslashes` function and it did the work. Thanks again :) – D-M Aug 26 '13 at 18:17
  • Is there a way to group the results date wise? – D-M Sep 06 '13 at 12:02
  • I meant, On the page from which i get this data also has date and then the schedule, right now i get all 3 days schedule in one list and just the time, so basically the data seems to be incomplete due to no date (It's like after the time is over there is another event at that same time but that is of next day's event). – D-M Sep 06 '13 at 12:05
  • I'm not sure I understand what you're asking. It might be best to open a new question and include the details there. – Ro Yo Mi Sep 06 '13 at 13:56
  • Okay thanks, I've just posted a new question. If possible, please check it out [here](http://stackoverflow.com/q/18664311/2716387) – D-M Sep 06 '13 at 18:33