0

This question is a continuation of my previous question:

Check tag & get the value inside tag using PHP

I have a text like this :

<ORGANIZATION>Head of Pekalongan Regency</ORGANIZATION>, Dra. Hj.. Siti Qomariyah , MA and her staff were greeted by <ORGANIZATION>Rector of IPB</ORGANIZATION> Prof. Dr. Ir. H. Herry Suhardiyanto , M.Sc. and <ORGANIZATION>officials of IPB</ORGANIZATION> in the guest room.

With the answer code from my question before and PREG_OFFSET_CAPTURE added like this :

function get_text_between_tags($string, $tagname) {
    $pattern = "/<$tagname\b[^>]*>(.*?)<\/$tagname>/is";
    preg_match_all($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
    if(!empty($matches[1]))
        return $matches[1];
    return array();
}

I get an output:

Array (
[0] => Array ( [0] => Head of Pekalongan Regency [1] => 14 )
[1] => Array ( [0] => Rector of IPB [1] => 131 )
[2] => Array ( [0] => officials of IPB [1] => 222 ) )

14, 131, 222 are the index of character when matching pattern. Can I get the index of word? I mean the output like this :

Array (
[0] => Array ( [0] => Head of Pekalongan Regency [1] => 0 )
[1] => Array ( [0] => Rector of IPB [1] => 15)
[2] => Array ( [0] => officials of IPB [1] => 27 ) )

Is there any other way than PREG_OFFSET_CAPTURE or need more code? I have no idea. Thanks for help. :)

Community
  • 1
  • 1
andrefadila
  • 647
  • 2
  • 9
  • 36
  • No, there is no built-in support for getting the word index instead. If that's really important (you didn't elaborate *why*, so I'm assuming it's not), you have to invest some work. Given the string indexes you already have, you can compare those against a list of word positions to be aquired in a second `preg_match_all('/\w+/'`. (Though requires first displacing the tags with spaces). – mario May 10 '13 at 01:43
  • Oh okay, sorry for my not representative question. Actually I have problem about check some phrase like "red apple" or "blue apple". Both phrase have "apple", but we don't know which red or blue "apple" who comes first if just use `preg_match_all('/\w+/'`. – andrefadila May 10 '13 at 02:01

1 Answers1

1

This will work, but will need a little bit of finishing up:

<?php

$raw = '<ORGANIZATION>Head of Pekalongan Regency</ORGANIZATION>, Dra. Hj.. Siti Qomariyah , MA and her staff were greeted by <ORGANIZATION>Rector of IPB</ORGANIZATION> Prof. Dr. Ir. H. Herry Suhardiyanto , M.Sc. and <ORGANIZATION>officials of IPB</ORGANIZATION> in the guest room.';

$result = getExploded($raw,'<ORGANIZATION>','</ORGANIZATION>');

echo '<pre>';
print_r($result);
echo '</pre>';

function getExploded($data, $tagStart, $tagEnd) {
    $tmpData = explode($tagStart,$data);
    $wordCount = 0;
    foreach($tmpData as $k => $v) {
        $tmp = explode($tagEnd,$v);
        $result[$k][0] = $tmp[0];
        $result[$k][1] = $wordCount;
        $wordCount = $wordCount + (count(explode(' ',$v)) - 1);
    }
    return $result;
}

?>

And the result is:

Array
(
    [0] => Array
        (
            [0] => 
            [1] => 0
        )

    [1] => Array
        (
            [0] => Head of Pekalongan Regency
            [1] => 0
        )

    [2] => Array
        (
            [0] => Rector of IPB
            [1] => 16
        )

    [3] => Array
        (
            [0] => officials of IPB
            [1] => 28
        )

    )
Tigger
  • 8,980
  • 5
  • 36
  • 40
  • Yeah, great it's work, thanks. What is in index array 0? – andrefadila May 10 '13 at 02:31
  • index array 0 is "blank" because the `explode()` is performed on the first `$tagStart` found, which just happens to be also the first part of the `$raw` data. Shift the first `` along or put some text in front of it and see what happens. You should simple be able to drop index 0 to fix the issue. – Tigger May 10 '13 at 03:08