2

Example 1:

I have a PDF document and used the PDF Parser (www.pdfparser.org) online to take all its content in text format. Rescued content in a TXT file (manually) and tried to filter some data using regular expression, everything worked normally.


Example 2:

To automate the process, I downloaded the PDF Parser API and made a script that follows the following rules:

1) Transforms the PDF text using the ParseFile () API method.
2) Saves the content of TXT.
3) Try to filter out this TXT using regular expression.


Example 1 -> It worked and returned me:

array (size = 2)
   'mora_dia' =>
     array (size = 1)
       0 => string 'R $ 3.44' (length = 7)
   'fine' =>
     array (size = 1)
       0 => string 'R $ 17.21' (length = 8)

Example 2 -> It did not work.

array (size = 2)
   'mora_dia' =>
     array (size = 0)
       empty
   'fine' =>
     array (size = 0)
       empty
  • Data from the two TXT are equal, but because in the second example does not work? * (I've tried to do this without saving in TXT but did not work)

Below are the codes of my two examples:

Example 1:

$data = file_get_contents('exemplo_01.txt');

$regex = [
    'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
    'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];

foreach($regex as $title => $ex)
{
    preg_match($ex, $data, $matches[$title]);
}

var_dump($matches);

Example 2:

$parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($PDFFile);
    $pages = $pdf->getPages();

    foreach ($pages as $page) {
        $PDFParse = $page->getText();
    }

    $txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
    $file  = fopen($txtName, 'w+');
    fwrite($file, $PDFParse);
    fclose($file);

    $dataTxt = file_get_contents($txtName);

    $regex = [
        'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
        'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
    ];

    foreach($regex as $title => $ex)
    {
        preg_match($ex, $dataTxt, $matches[$title]);
    }
  • 1
    How did you verify that the two produced text files are identical? Did you inspect them with a hex editor, or check their md5sum? There may be a difference in trailing line break, for example. Did you try `$dataTxt = trim($dataTxt);`? – Michael Berkowski Dec 21 '14 at 22:27
  • @MichaelBerkowski This is the text from the first example -> http://pastebin.com/txNtnERG | This is the text from the second example -> http://pastebin.com/H7D5xJBH –  Dec 21 '14 at 22:37
  • These differ in the type of whitespace between `R$` and the number. Your copy/paste action might have caused that, but example2 has 0xA0 instead of a regular space (0x20). Apparently A0 is a non-breaking space (http://www.fileformat.info/info/unicode/char/a0/index.htm) – Michael Berkowski Dec 21 '14 at 22:56
  • In fact, it looks like all the spaces in example 2 are non-breaking 0xA0. – Michael Berkowski Dec 21 '14 at 22:58

2 Answers2

0
 $PDFParse ='';
 foreach ($pages as $page) {
     $PDFParse = $PDFParse.$page->getText();
 }

If PDFParse is string and after fwrite try fflush($file)

Andrii
  • 51
  • 5
  • The result remains the same. I opened the txt saved before the change and he's contains written data as well as the first example. –  Dec 21 '14 at 22:39
0

Your action of copying and pasting the output text manually seems to have actually changed its contents. Based on the pastebin output, the direct to file version contains non-breaking space characters rather than regular spaces. The non-breaking spaces have hex code 0xA0, ascii 160, as opposed to a regular space, hex 0x20 ascii 32.

In fact, it looks as though all the space characters in the direct to file example are non-breaking 0xA0 spaces.

To reform your regular expression to be able to accommodate either type of space, you can place the hex code into a [] character class along with the regular space character ' ' as in [ \xA0] to match either type. Further, you will need the /u flag to work with unicode.

$regex = [
    'mora_dia' => '/R\$[ \xA0][0-9]{1,}\.[0-9]{1,}/iu',
    'multa'    => '/R\$[ \xA0][0-9]{1,},[0-9]{1,}/iu'
];

(note, the , comma does not require backslash-escaping)

This works correctly, using your raw pastebin as input:

$str = file_get_contents('http://pastebin.com/raw.php?i=H7D5xJBH');
preg_match('/R\$[ \xa0][0-9]{1,}\.[0-9]{1,}/ui', $str, $matches);
var_dump($matches);

// Prints:
array(1) {
  [0] =>
  string(8) "R$ 3.44"
}

A different solution might be to replace the non-breaking spaces with regular spaces in the entire text before applying your original regular expression:

// Replace all non-breaking spaces with regular spaces in the
// text string read from the file...
// The unicode non-breaking space is represented by 00A0
// and both are needed to replace this successfully.
$dataTxt = str_replace("\x00\xA0", " ", $dataTxt);

Whenever you have input you expect to be identical, which appears visually to be identical, be sure to inspect it with a tool capable of displaying each characters hex codes. In this case, I copied your samples from pastebin into files and inspected them with Vim, where I have setup hex and ascii display for the character under the cursor.

Michael Berkowski
  • 267,341
  • 46
  • 444
  • 390
  • Thanks, but unfortunately still is not working. It does not work only in the file that was generated automatically. In the file that was copied and pasted it was already running. –  Dec 21 '14 at 23:27
  • I ran into this problem as well, and this answer is extremely close. It's not A0, it's actually line feed or 0a. By using bin2hex, followed by str_replace on 0a, then hex2bin on the value my regex worked. Alternatively see: https://stackoverflow.com/questions/10757671/how-to-remove-line-breaks-no-characters-from-the-string – Chris Feb 18 '19 at 05:46