[php]how to extract a single simple text from a long html source

Question

i have a html like this:

......whatever very long html.....

<span class="title">hello world!</span>

......whatever very long html......

it is a very long html and i only want the content 'hello world!' from this html i got this html by

$result = file_get_contents($url , false, $context);

many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.

how should i do it? any suggestions? any help would be really great.

thanks in advance!

If you have problems with constructing regex expressions, you can always fall back on the simpler split() function. — Waygood, Aug 07 '12 at 09:37
Parsing HTML with regex is generally considered a bad idea. Possible related question and... if the related answer is not **technical**, it can describe the _world implosion_ you might summon upon us by parsing HTML with RegEx: http://stackoverflow.com/a/1732454/1428773 — Whisperity, Aug 07 '12 at 09:46

Billy Moon · Accepted Answer · 2012-08-07T09:44:38.243

Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...

// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];

preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).

score 0 · Answer 2 · edited Aug 07 '12 at 09:48

0

No, I really don't think regEx or similar functions would be either more effective or easier.

If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:

//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;

echo($spanValue);

with preg_match you could do like this:

preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

or this, if there are multiple spans with the class "title":

preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

edited Aug 07 '12 at 09:48

Nick

6,316
2
29
47

answered Aug 07 '12 at 09:43

Marcus Olsson

2,485
19
35

From personal experience, I found regex to be way better when having very long html strings. SimpleHTML DOM is too slow, hits memory limit[in this context]. – Prasanth Aug 07 '12 at 09:56
Sometimes, yes – it is absolutely an alternative. However, there is potentially a world of headaches using regular expressions – what if there are multiple spans inside the span in the example for instance? What if the HTML is malformed? – Marcus Olsson Aug 07 '12 at 10:52
When I used this library for getting links with a class starting with something, I found that, it can be done like in jquery using `^=`. Now, when I searched the library file for `^=`, I found that it was indeed using regex! I was shocked. Eventually, I used regex because I had hit memory limit using library. Now, hearing you say that regex wouldn't be effective than simplehtmldom, I felt you were partly incorrect because the library uses regex! I also realize other parts of the library are good. But it's a different matter. – Prasanth Aug 07 '12 at 11:13
1

Okay, I'll buy that =) But – as I said, it will very seldom be easier to use "pure" regex, because you can't count on the HTML to absolutely correct, Simple HTML DOM will help you out on this. But yes, Simple HTML DOM have some major memory issues. – Marcus Olsson Aug 07 '12 at 11:21

[php]how to extract a single simple text from a long html source

2 Answers2