preg_match() + regex does not work in TXT file

Question

Example 1:

I have a PDF document and used the PDF Parser (www.pdfparser.org) online to take all its content in text format. Rescued content in a TXT file (manually) and tried to filter some data using regular expression, everything worked normally.

Example 2:

To automate the process, I downloaded the PDF Parser API and made a script that follows the following rules:

1) Transforms the PDF text using the ParseFile () API method.
2) Saves the content of TXT.
3) Try to filter out this TXT using regular expression.

Example 1 -> It worked and returned me:

array (size = 2)
   'mora_dia' =>
     array (size = 1)
       0 => string 'R $ 3.44' (length = 7)
   'fine' =>
     array (size = 1)
       0 => string 'R $ 17.21' (length = 8)

Example 2 -> It did not work.

array (size = 2)
   'mora_dia' =>
     array (size = 0)
       empty
   'fine' =>
     array (size = 0)
       empty

Data from the two TXT are equal, but because in the second example does not work? * (I've tried to do this without saving in TXT but did not work)

Below are the codes of my two examples:

Example 1:

$data = file_get_contents('exemplo_01.txt');

$regex = [
    'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
    'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];

foreach($regex as $title => $ex)
{
    preg_match($ex, $data, $matches[$title]);
}

var_dump($matches);

Example 2:

$parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($PDFFile);
    $pages = $pdf->getPages();

    foreach ($pages as $page) {
        $PDFParse = $page->getText();
    }

    $txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
    $file  = fopen($txtName, 'w+');
    fwrite($file, $PDFParse);
    fclose($file);

    $dataTxt = file_get_contents($txtName);

    $regex = [
        'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
        'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
    ];

    foreach($regex as $title => $ex)
    {
        preg_match($ex, $dataTxt, $matches[$title]);
    }

How did you verify that the two produced text files are identical? Did you inspect them with a hex editor, or check their md5sum? There may be a difference in trailing line break, for example. Did you try `$dataTxt = trim($dataTxt);`? — Michael Berkowski, Dec 21 '14 at 22:27
@MichaelBerkowski This is the text from the first example -> http://pastebin.com/txNtnERG | This is the text from the second example -> http://pastebin.com/H7D5xJBH — , Dec 21 '14 at 22:37
These differ in the type of whitespace between `R$` and the number. Your copy/paste action might have caused that, but example2 has 0xA0 instead of a regular space (0x20). Apparently A0 is a non-breaking space (http://www.fileformat.info/info/unicode/char/a0/index.htm) — Michael Berkowski, Dec 21 '14 at 22:56
In fact, it looks like all the spaces in example 2 are non-breaking 0xA0. — Michael Berkowski, Dec 21 '14 at 22:58

score 0 · Answer 1 · answered Dec 21 '14 at 22:34

0

 $PDFParse ='';
 foreach ($pages as $page) {
     $PDFParse = $PDFParse.$page->getText();
 }

If PDFParse is string and after fwrite try fflush($file)

answered Dec 21 '14 at 22:34

Andrii

51
5

The result remains the same. I opened the txt saved before the change and he's contains written data as well as the first example. – Dec 21 '14 at 22:39

Michael Berkowski · Accepted Answer · 2014-12-21T23:35:53.523

Your action of copying and pasting the output text manually seems to have actually changed its contents. Based on the pastebin output, the direct to file version contains non-breaking space characters rather than regular spaces. The non-breaking spaces have hex code 0xA0, ascii 160, as opposed to a regular space, hex 0x20 ascii 32.

In fact, it looks as though all the space characters in the direct to file example are non-breaking 0xA0 spaces.

To reform your regular expression to be able to accommodate either type of space, you can place the hex code into a [] character class along with the regular space character ' ' as in [ \xA0] to match either type. Further, you will need the /u flag to work with unicode.

$regex = [
    'mora_dia' => '/R\$[ \xA0][0-9]{1,}\.[0-9]{1,}/iu',
    'multa'    => '/R\$[ \xA0][0-9]{1,},[0-9]{1,}/iu'
];

(note, the , comma does not require backslash-escaping)

This works correctly, using your raw pastebin as input:

$str = file_get_contents('http://pastebin.com/raw.php?i=H7D5xJBH');
preg_match('/R\$[ \xa0][0-9]{1,}\.[0-9]{1,}/ui', $str, $matches);
var_dump($matches);

// Prints:
array(1) {
  [0] =>
  string(8) "R$ 3.44"
}

A different solution might be to replace the non-breaking spaces with regular spaces in the entire text before applying your original regular expression:

// Replace all non-breaking spaces with regular spaces in the
// text string read from the file...
// The unicode non-breaking space is represented by 00A0
// and both are needed to replace this successfully.
$dataTxt = str_replace("\x00\xA0", " ", $dataTxt);

Whenever you have input you expect to be identical, which appears visually to be identical, be sure to inspect it with a tool capable of displaying each characters hex codes. In this case, I copied your samples from pastebin into files and inspected them with Vim, where I have setup hex and ascii display for the character under the cursor.

Thanks, but unfortunately still is not working. It does not work only in the file that was generated automatically. In the file that was copied and pasted it was already running. — , Dec 21 '14 at 23:27
I ran into this problem as well, and this answer is extremely close. It's not A0, it's actually line feed or 0a. By using bin2hex, followed by str_replace on 0a, then hex2bin on the value my regex worked. Alternatively see: https://stackoverflow.com/questions/10757671/how-to-remove-line-breaks-no-characters-from-the-string — Chris, Feb 18 '19 at 05:46

preg_match() + regex does not work in TXT file

2 Answers2