0

I have a Text.xml file with some text and the bibliographic references in this text. Its look like this:

Text.xml

<p>…blabla S.King (1987). Bla bla bla J.Doe (2001) blabla bla J.Martin (1995) blabla…</p>

And I have a Reference.txt file with list of bibliographic references and ID number for each reference. Its look like this:

Reference.txt

b1#S.King (1987)
b2#J.Doe (2001)
b3#J.Martin (1995)

I would like to find all bibliographic references from Reference.txt into Text.xml and then add a tag with ID. The goal is TextWithReference.xml who must look like this:

TextWithReference.xml

<p>…blabla <ref type="biblio" target=“b1”>S.King (1987)</ref>. Bla bla bla <ref type="biblio" target=“b2”>J.Doe (2001)</ref> blabla bla <ref type="biblio" target=“b3”>J.Martin (1995)</ref> blabla…</p>

To do this, I use a php file.

Search&Replace.php

<?php
$handle = fopen("Reference.txt","r");
while(!feof($handle))
{
    $ligne = fgets($handle,1024);
    $tabRef[] = $ligne;
}   
fclose($handle);

$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while(!feof($handleXML))
{
    $ligneXML = fgets($handleXML,2048);
        for($i=0;$i<sizeof($tabRef);$i++)
        {
            $tabSearch = explode('/#/',$tabRef[$i]);
            $xmlID = $tabSearch[0];
            $searchString = trim($tabSearch[1]);
            if(preg_match('/$searchString/',$ligneXML))
            {
                $ligneXML = preg_replace('/($searchString)/','/<ref type=\"biblio\" target=\"#$xmlID\">\\0</ref>/',$ligneXML);
            }

        }
    fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);

?>

The problem is that this php script just copy Text.xml in TextWithReference.xml without identifing the bibliographic references and without adding the tags…

Many thanks for your help!

  • If your Text.xml is really a well formed xml, I think that the faster way (clean, and use few memory) is to use the combo XMLReader/XMLWriter to create TextWithReference.xml. – Casimir et Hippolyte Feb 11 '16 at 01:19
  • Can you provide a url for your two xml files? – fusion3k Feb 11 '16 at 01:23
  • You should trim and explode the search strings when you're creating `$tabRef`, not for every line in the XML file. – Barmar Feb 11 '16 at 01:25
  • Yes, the Text.xml is well formed xml. You mean to use regex directly in the XmlEditor or to use the xslt? – Andrew Green Feb 11 '16 at 01:26
  • No, XMLReader is a build-in PHP class designed to parse an XML file element by element (an opening tag, a comment, a text node...) and XMLWriter write an XML file element by element too. XSLT why not, but it isn't very handy and particularly slow with PHP. – Casimir et Hippolyte Feb 11 '16 at 01:30

1 Answers1

0

There are a number of problems with your code.

  1. The search strings contain characters that are special in regular expressions, such as parentheses. You need to escape these if you want to match them literally. The preg_quote function does this.

  2. Your file-reading loops are not correct. while (!feof()) is not the correct way to read through a file, because the EOF flag isn't set until after you read at the end of the file. So you'll go through the loops an extra time. The proper way to write this is while ($ligne = fgets()).

  3. You have single quotes around the strings where you're trying to substitute $searchString and $xmlID. Variables are only substituted inside double quotes. See What is the difference between single-quoted and double-quoted strings in PHP?

  4. You don't need to put / delimiters around the replacement string in preg_replace.

  5. It's inefficient to explode, trim and escape the lines from the Reference.txt every time you're processing a line in Text.xml. Do it once when you're reading Reference.txt.

  6. In the replacement string, use $0 to replace with the matched text from the source. \0 is an obsolete method that isn't recommended.

  7. You don't need parentheses around the search string in the regexp, since you're not using the $1 capture group in the replacement. And since it's around the whole regexp, it's the same as $0.

Here's the working rewrite:

<?php
$handle = fopen("Reference.txt","r");
$tabRef = array();
while($ligne = trim(fgets($handle,1024))) {
    list($xmlID, $searchString) = explode('#', $ligne);
    $tabRef[] = array($xmlID, preg_quote($searchString));
}   
fclose($handle);

$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while($ligneXML = fgets($handleXML,2048)) {
    foreach ($tabRef as $tabSearch) {
        $xmlID = $tabSearch[0];
        $searchString = $tabSearch[1];
        if(preg_match("/$searchString/",$ligneXML)) {
            $ligneXML = preg_replace("/$searchString/","<ref type=\"biblio\" target=\"#$xmlID\">$0</ref>",$ligneXML);
        }
    }
    fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);

?>

Another improvement takes advantage of the ability to give use arrays as the search and replacement arguments to preg_replace, instead of using a loop. When reading Reference.txt, create the regexp and replacement strings there, and put them each into an array.

<?php
$handle = fopen("Reference.txt","r");
$search = array();
$replacement = array();
while($ligne = trim(fgets($handle,1024))) {
    list($xmlID, $searchString) = explode('#', $ligne);
    $search[] = "/" . preg_quote($searchString) . "/";
    $replacement[] = "<ref type=\"biblio\" target=\"#$xmlID\">$0</ref>";
}   
fclose($handle);

$handleXML = fopen("Text.xml","r");
$fp = fopen("TextWithReference.xml", "w");
while($ligneXML = fgets($handleXML,2048)) {
    $ligneXML = preg_replace($search,$replacement,$ligneXML);
    fwrite($fp, $ligneXML);
}
fclose($handleXML);
fclose($fp);

?>
Community
  • 1
  • 1
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Many many thanks! It works like a charme! And your script a much more faster! And I’m very grateful for all your detailed explanations; it will be very useful for me! Thanks a lot!!! – Andrew Green Feb 11 '16 at 02:06