-1

I need help to develop a regular expression to grab some data from HTML. The HTML pattern looks like following

<h5>Work Experience</h5>
  <p><span id="organization">Company Name 1</span></p>
  Designation 1
    <p>Date 1
  </p>
    <ul>
      <li>Some text 1</li>
    </ul>
  <p><span id="organization">Company Name 2</span></p>
  Designation 2
    <p>Date 2
  </p>
    <ul>
      <li>Some text 2</li>
    </ul>
  <p><span id="organization">Company Name 3</span></p>
  Designation 3
    <p>Date 3
  </p>
    <ul>
      <li>Some text 3</li>
    </ul></div>

I've tries with the following regular expression:

|<h5>Work Experience<\/h5>\s*<p>(.*)<\/p>(.*)<p>(.*)<\/p>\s*<ul>(.*)<\/ul>\s*<\/div>|Uis

I neeed all company name, designation and dates.

Please help me. Thanks in advance.

  • You may want to use `explode()` instead. Read how it works here: http://php.net/manual/en/function.explode.php – node_modules Apr 13 '16 at 07:42
  • You may need to use this approach http://stackoverflow.com/questions/8144061/using-php-to-get-dom-element – hmd Apr 13 '16 at 07:49

5 Answers5

3

Don't use regexes to parse HTML (see this famous answer for a detailed explanation why). Use something like DOM instead, this makes things much easier. For the example above, you can then do:

$doc = new DOMDocument();
$doc->loadHTML($html); // $html should contain the HTML source

// Get all spans from the document
$spans = $doc->getElementsByTagName('span');

// Loop over the spans
foreach ($spans as $span) {
    // Check if the span has an id attribute with "organization" as value
    if ($span->hasAttribute('id') && $span->getAttribute('id') === 'organization') {
        echo $span->nodeValue; // This will echo the company name
    }
}

You can see a full working example and it's outcome here: https://3v4l.org/XdrQ1

Community
  • 1
  • 1
Oldskool
  • 34,211
  • 7
  • 53
  • 66
3

Yet another advice to use a parser instead. Consider this example with SimpleXML and xpath queries. Additionally, IDs need to be unique, so better use a class:

<?php
$html = '
<div>
    <h5>Work Experience</h5>
    <p><span class="organization">Company Name 1</span></p>
    Designation 1
    <p>Date 1</p>
    <ul>
      <li>Some text 1</li>
    </ul>
    <p><span class="organization">Company Name 2</span></p>
    Designation 2
    <p>Date 2</p>
    <ul>
      <li>Some text 2</li>
    </ul>
</div>';

$xml = simplexml_load_string($html);
$spans = $xml->xpath("//span[@class='organization']");

foreach ($spans as $span) {
    // do sth. useful here
}
?>

Hint:

As @Oldskool pointed out, you may not have access to the original (invalid) HTML strings. In this case, you need to change the query like so:

$spans = $xml->xpath("//span[@id='organization']");
Jan
  • 42,290
  • 8
  • 54
  • 79
2

Try this

<span id="organization">(?<company_name>[^<]+)<\/span><\/p>\n\s*(?<designation>[^\n]+)\n\s*<p>(?<date>[^\n]+)

Regex demo

Output:

MATCH 1
company_name    [54-68] `Company Name 1`
designation [82-95] `Designation 1`
date    [103-109]   `Date 1`
MATCH 2
company_name    [192-206]   `Company Name 2`
designation [220-233]   `Designation 2`
date    [241-247]   `Date 2`
MATCH 3
company_name    [330-344]   `Company Name 3`
designation [358-371]   `Designation 3`
date    [379-385]   `Date 3`
Tim007
  • 2,557
  • 1
  • 11
  • 20
2

I suggest using SimpleXML instead of a regex for this case as this enables you to use specific selectors for parsing the DOM.

Moreover, IDs in the DOM should be unique.

More information about SimpleXML: http://en.php.net/SimpleXML

Andreas
  • 2,821
  • 25
  • 30
  • `SimpleXML` or `DOMDocument` may not good for performance in case a huge html string, since `loadHTML` could be died after exceed `maximum execution time` – Tim007 Apr 13 '16 at 08:17
1

Here is my demo. Just loops through using explode to break up the string:

<?php
$html = '<div>
    <h5>Work Experience</h5>
    <p><span class="organization">Company Name 1</span></p>
    Designation 1
    <p>Date 1</p>
    <ul>
      <li>Some text 1</li>
    </ul>
    <p><span class="organization">Company Name 2</span></p>
    Designation 2
    <p>Date 2</p>
    <ul>
      <li>Some text 2</li>
    </ul>
</div>';

$companyBlocks = explode('</ul>', $html);

for($i=0; $i < count($companyBlocks); $i++){
    $company = explode('organization">', $companyBlocks[$i]);
    $company = explode('</span>', $company[1]);
    echo 'Company: ' . $company[0] . '<br>';

    $designation = explode('</span></p>', $companyBlocks[$i]);
    $designation = explode('<p>', $designation[1]);
    echo 'Designation: ' . $designation[0] . '<br>';

    $date = explode('</span></p>', $companyBlocks[$i]);
    $date = explode('<p>', $date[1]);
    $date = explode('</p>', $date[1]);
    echo 'Date: ' . $date[0] . '<br>';
}
Ben Rhys-Lewis
  • 3,118
  • 8
  • 34
  • 45