4

I have a file that looks like this (yes the line breaks are right):

39                                              9
30 30 30 31 34 30 30 32 33 32 36 30 31 38 0D 0A 00014002326018..
39 30 30 30 31 34 30 30 32 33 32 36 30 35 34 0D 900014002326054.
0A                                              .
39 30 30 30 31 34 30 30 32 33 32 36 30 39 31 0D 900014002326091.
0A                                              .
39 30 30 30 31 34 30 30 32 33 32 36 31 36 33 0D 900014002326163.
0A                                              .
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 32 30 30 0D 0A                            26200..
39                                              9
30 30 30 31 34 30 30 32 33 32 36 32 30 30 0D 0A 00014002326200..
39 30 30 30 31 34 30 30 32 33 32 36 31 32 32 0D 900014002326122.
0A                                              .
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 31 35 34 0D 0A                            26154..
39 30 30 30 31 34 30 30 32 33                   9000140023
32 36 31 33 31 0D 0A                            26131..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 31 30 34 0D 0A                            26104..
39 30 30 30 31 34 30 30 32 33 32 36 30 39 30 0D 900014002326090.
0A                                              .
39 30 30 30 31 34 30 30 32 33 32 36 31 39 37 0D 900014002326197.
0A                                              .
39                                              9
30 30 30 31 34 30 30 32 33 32 36 32 30 38 0D 0A 00014002326208..
39 30 30 30 31 34 30 30 32 33                   9000140023
32 36 31 31 35 0D 0A                            26115..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 31 36 34 0D 0A                            26164..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 30 31 36 0D 0A 39 30 30 30 31 34 30 30 32 26016..900014002
33                                              3
32 36 32 34 36 0D 0A                            26246..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 32 34 36 0D 0A                            26246..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 30 37 39 0D 0A                            26079..
39                                              9
30 30 30 31 34 30 30 32 33                      000140023
32 36 31 32 30 0D 0A                            26120..
39                                              9
30 30 30 31 34 30 30 32 33 32 36 32 32 38 0D 0A 00014002326228..
39 30 30 30 31 34 30 30 32 33                   9000140023
32 36 31 38 36 0D 0A                            26186..

I have this code that grabs the EID tags (the numbers that start with 9000) but I can't figure out how to get it to do multiple lines.

$data = file_get_contents('tags.txt');

$pattern = "/(\d{15})/i";

preg_match_all($pattern, $data, $tags);
$count = 0;
foreach ( $tags[0] as $tag ){

    echo $tag . '<br />';
    $count++;
}

echo "<br />" . $count . " total head scanned";

For example the first and second line should return 900014002326018 instead of ignoring the first and second line

I am not good with regular expressions, so if you could explain so I learn and stop having to have someone help me with simple regex, that would be awesome.

EDIT: The whole number is 15 digits starting with 9000

Toby Joiner
  • 4,306
  • 7
  • 32
  • 47
  • Which ones are you expecting it to grab? Are the line breaks significant? I know you said the are right, but do they end the number? Do the periods? What is a "complete" number? It isn't really clear what you are looking for. – OtherDevOpsGene Nov 29 '13 at 19:24
  • 1
    "9" in first line doesn't have any thing related to its rest part in second line. so how would you want it from regex?! – revo Nov 29 '13 at 19:25
  • Sorry, I thought I added this: The whole number is 15 digits starting with 9000. I don't know that they will always end in .. – Toby Joiner Nov 29 '13 at 19:43
  • The comment by @revo above is really the reason that the solution presented by squeamish ossifrage below should be strongly looked at. You need to take the first steps of just discarding useless data, then this problem becomes much easier to solve. – Mike Brant Nov 29 '13 at 20:09

3 Answers3

6

You can do this:

$result = preg_replace('~\R?(?:[0-9A-F]{2}\h+)+~', '', $data);
$result = explode('..', rtrim($result, '.'));

pattern details:

\R?            # optional newline character
(?:            # open a non-capturing group
  [0-9A-F]{2}  # two hexadecimal characters
  \h+          # horizontal white characters (spaces or tabs)
)+             # repeat the non-capturing group one or more times

After this replacement the only content you must remove are the two dots. After removing the trailing dots, you can use these to explode the string to an array.

An other way

Since you know that there is always 48 characters before the part of integers (and dots), you can use this pattern too:

$result = preg_replace('~(?:^|\R).{48}~', '', $data);

An other way without regex

The idea is to read the file line by line and, since the length before the content is always the same (i.e. 16*3 characters -> 48 characters), extract the substring with the integer and concatenate it into the $data temporary variable.

ini_set("auto_detect_line_endings", true);
$data = '';
$handle = @fopen("tags.txt", "r");
if ($handle) {
    while (($buffer = fgets($handle, 128)) !== false) {
        $data .= substr($buffer, 48, -1);
    }
    if (!feof($handle)) {
        echo "Error: fgets() has failed\n";
    }
    fclose($handle);
} else {
    echo "Error opening the file\n";
}

$result = explode ('..', rtrim($data, '.'));

Note: if the file has a windows format (with the end of line \r\n) you must change the third parameter of the substr() function to -2. If you are interested by how to detect newlines type, you can take a look at this post.

Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
4

I don't think it's even possible to do this with a single regex, but your code will be far more legible and maintainable if you approach this one step at a time.

This works, and it shouldn't be too hard to figure out how it works:

$eid_tag_src = <<<END_EID_TAGS
39                                              9
30 30 30 31 34 30 30 32 33 32 36 30 31 38 0D 0A 00014002326018..
39 30 30 30 31 34 30 30 32 33 32 36 30 35 34 0D 900014002326054.
  :
 etc.
  :
39 30 30 30 31 34 30 30 32 33                   9000140023
32 36 31 38 36 0D 0A                            26186..
END_EID_TAGS;

/* Remove hex data from first 48 characters of each line */
$eid_tag_src = preg_replace('/^.{48}/m','',$eid_tag_src);

/* Remove all white space */
$eid_tag_src = preg_replace('/\s+/','',$eid_tag_src);

/* Replace dots (CRLF) with spaces */
$eid_tag_src = str_replace('..',' ',$eid_tag_src);

/* Convert to array of EID tags */
$eid_tags = explode(' ',trim($eid_tag_src));

print_r($eid_tags);

Here's the output:

Array
(
    [0] => 900014002326018
    [1] => 900014002326054
    [2] => 900014002326091
    [3] => 900014002326163
    [4] => 900014002326200
    [5] => 900014002326200
    [6] => 900014002326122
    [7] => 900014002326154
    [8] => 900014002326131
    [9] => 900014002326104
    [10] => 900014002326090
    [11] => 900014002326197
    [12] => 900014002326208
    [13] => 900014002326115
    [14] => 900014002326164
    [15] => 900014002326016
    [16] => 900014002326246
    [17] => 900014002326246
    [18] => 900014002326079
    [19] => 900014002326120
    [20] => 900014002326228
    [21] => 900014002326186
)
r3mainer
  • 23,981
  • 3
  • 51
  • 88
  • I like this approach, though it could be argued most step could be achieved by simple string manipulation rather than regex. – Mike Brant Nov 29 '13 at 19:53
  • @MikeBrant That's true. I'll see if I can cut out some of the preg_replace calls – r3mainer Nov 29 '13 at 19:55
  • OK, I used str_replace to replace the dots. The other two calls would involve too much faffing around – r3mainer Nov 29 '13 at 20:01
  • @squamishossifrage I would think the first regex (to eliminate first 48 characters of each line) would be the easiest one to perform via string manipulation, particularly if the file were read in one line at a time (rather than entire file at once as OP shows). – Mike Brant Nov 29 '13 at 20:08
1

Here's an approach using effective grabbing (without replacing):

RegEx: /(?:^.{48}|\.)([0-9]+\.?)/m - explained demo

Which means (in plain english): start grabbing digits followed by an optional dot IF from the start of the line there are 48 characters in front of them OR a dot (special case).

And your code could look like this:

$pattern = '/(?:^.{48}|\.)([0-9]+\.?)/m'; 

preg_match_all($pattern, $data, $tags);

//join all the bits belonging to the number
$data=implode("", $tags[1]); 

//count the dots to have a correct count of the numbers grabbed
//since each number was grabbed with an ending dot initially
$count=substr_count($data, ".");

//replace the dots with a html <br> tag (avoiding a split and a foreach loop)
$tags=str_replace('.', "<br>", $data); 

print $tags . "<br>" . $count . " total scanned";

See the code live at http://3v4l.org/Z4EhI

CSᵠ
  • 10,049
  • 9
  • 41
  • 64