-1

i have below data in plain text format and i am trying to sort it in specific format so that i can save each value in Database table

G28585 alphabounce+ 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3 D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3 D97031 adizero 8.0 SK 68 $68.75 $68.75 $4,675.00 FTWWHT/POWRED/ACTRED Size 10 10- 11 11- 12 12- 13 13- 14 15 Qty 3 4 4 3 5 3 25 4 15 2 F97396 Freak Carbon Low 37 $49.50 $49.50 $1,831.50 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 14 Qty 2 1 3 4 3 2 6 3 7 6

i am trying to save it in this format and what i am not able to differentiate is that how should i separate the text as i dont have any deliminator:

G28585 alphabounce+ 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3

............

D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3

please guide, any help will be appreciated. This is pdf from which i get a text.. enter image description here

Mohanish
  • 269
  • 3
  • 11
  • What is the logic you want to separate according? – dWinder Dec 20 '18 at 08:47
  • 1
    I am afraid if you don't know how to separate things then we certainly can't tell you. You must at least have some rules, pattern, something. – arkascha Dec 20 '18 at 08:48
  • Could you provide some code of what you have tried so far and give a specific problem? – mbuechmann Dec 20 '18 at 08:52
  • 2
    The break point would appear to be the `G28585`/`D97028` fields. As that format i.e. Capital Letter and 5 numerics appears to be unique in the data you show a regex could be built? – RiggsFolly Dec 20 '18 at 08:52
  • Yes @RiggsFolly but break point is needed but 1 st field would not be Capital Letter and 5 numerics so this i can't use to separate – Mohanish Dec 20 '18 at 08:58
  • Have you looked at the string closely? Maybe it has tabs or some other character that can't be seen with eyes? – Andreas Dec 20 '18 at 09:01
  • 1
    Mayby regex pattern `/(?=[A-Z]\d+)/` https://regex101.com/r/Xt2hXr/1/ – Mohammad Dec 20 '18 at 09:04
  • yes @Andreas i tried to find but there is nothing except spaces.all i get after pdf to text conversion is plain text ... – Mohanish Dec 20 '18 at 09:04
  • Maybe you can work with the PDF instead? There are PDF libraries to PHP – Andreas Dec 20 '18 at 09:05
  • i had used PdfParser(https://pdfparser.org/demo) to get this text.. – Mohanish Dec 20 '18 at 09:07
  • @PrabowoMurti do you really think that? Why would you create a PHP program to do something he already did in the question? – Andreas Dec 20 '18 at 09:07
  • @Mohanish o have never used that parser (or any PDF parser) but I would look at the code in the parser to see if you can spot where it goes from one product to the other. I bet there is some code that you can find where it returns to the left, or when it reads the black line between the two products – Andreas Dec 20 '18 at 09:10

3 Answers3

3

You can do like this if your code only contains character followed by 5 digit

$str = "G28585 alphabounce+ 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3 D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3 D97031 adizero 8.0 SK 68 $68.75 $68.75 $4,675.00 FTWWHT/POWRED/ACTRED Size 10 10- 11 11- 12 12- 13 13- 14 15 Qty 3 4 4 3 5 3 25 4 15 2 F97396 Freak Carbon Low 37 $49.50 $49.50 $1,831.50 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 14 Qty 2 1 3 4 3 2 6 3 7 6";

 $splitted = preg_split('/(?=[A-Z][0-9]){6}/',$str,-1,PREG_SPLIT_NO_EMPTY);

print_r($splitted);

http://sandbox.onlinephpfunctions.com/code/7ad702ae20c9884c8c2348052610ad979f43d876

Shibon
  • 1,552
  • 2
  • 9
  • 20
2

You could place a delimeter in front of the product code (this appears to be a unique pattern) and then split on that. Here I'm using a newline as the delimeter:

<?php

$txt = "G28585 alphabounce+ 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3 D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3 D97031 adizero 8.0 SK 68 $68.75 $68.75 $4,675.00 FTWWHT/POWRED/ACTRED Size 10 10- 11 11- 12 12- 13 13- 14 15 Qty 3 4 4 3 5 3 25 4 15 2 F97396 Freak Carbon Low 37 $49.50 $49.50 $1,831.50 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 14 Qty 2 1 3 4 3 2 6 3 7 6";

var_export(
    preg_split(
        '/\R/',
        trim(
            preg_replace(
                '/([A-Z][0-9]+)/',
                "\n$1",
            $txt)
        )
    )
);

Output:

array (
  0 => 'G28585 alphabounce+ 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3 ',
  1 => 'D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3 ',
  2 => 'D97031 adizero 8.0 SK 68 $68.75 $68.75 $4,675.00 FTWWHT/POWRED/ACTRED Size 10 10- 11 11- 12 12- 13 13- 14 15 Qty 3 4 4 3 5 3 25 4 15 2 ',
  3 => 'F97396 Freak Carbon Low 37 $49.50 $49.50 $1,831.50 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 14 Qty 2 1 3 4 3 2 6 3 7 6',
)
Progrock
  • 7,373
  • 1
  • 19
  • 25
  • Thank you @Progrock for this solution but it not works for some string text like this: 12FXPK2SDUJW9Y SOUTH DAKOTA, U OF SDAKO GAMEMODE POL 11 $37.50 $37.50 $412.50 PWRD MEL/WHT Size S M L X Qty 1 6 3 1 – Mohanish Dec 20 '18 at 09:41
  • Using a positive lookahead as @Shibon does, you can do this in one pass. – Progrock Dec 20 '18 at 09:53
  • 1
    @Mohanish oh a different data sample. You'll have to adapt the regex to suit. – Progrock Dec 20 '18 at 09:54
2

another solution with regular expressions

$raw = 'G28585 SOUTH DAKOTA, U OF SDAKO GAMEMODE POL 20 $55.00 $55.00 $1,100.00 FTWWHT/CYBEMT/ECRTIN Size 9- 11 11- 12 12- 13 14 Qty 2 6 2 2 1 4 3 D97028 adizero 8.0 100 $66.00 $66.00 $6,600.00 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 13- 14 Qty 5 5 10 8 15 10 17 9 14 4 3 D97031 adizero 8.0 SK 68 $68.75 $68.75 $4,675.00 FTWWHT/POWRED/ACTRED Size 10 10- 11 11- 12 12- 13 13- 14 15 Qty 3 4 4 3 5 3 25 4 15 2 F97396 Freak Carbon Low 37 $49.50 $49.50 $1,831.50 FTWWHT/POWRED/ACTRED Size 9 9- 10 10- 11 11- 12 12- 13 14 Qty 2 1 3 4 3 2 6 3 7 6';

$pattern ='/(?<style>\w+) (?<descr>[a-zA-Z, 0-9.+]+) (?<prices>[$0-9., ]+) (?<color>\w+\/\w+\/\w+) Size (?<size>[0-9 -]+) Qty (?<quantity>[0-9 ]+)/';

preg_match_all($pattern, $raw, $matches);
$matches = array_diff_key($matches, range(0, 6));
print_r($matches);

output:

Array
(
    [style] => Array
        (
            [0] => G28585
            [1] => D97028
            [2] => D97031
            [3] => F97396
        )

    [descr] => Array
        (
            [0] => SOUTH DAKOTA, U OF SDAKO GAMEMODE POL 20
            [1] => adizero 8.0 100
            [2] => adizero 8.0 SK 68
            [3] => Freak Carbon Low 37
        )

    [prices] => Array
        (
            [0] => $55.00 $55.00 $1,100.00
            [1] => $66.00 $66.00 $6,600.00
            [2] => $68.75 $68.75 $4,675.00
            [3] => $49.50 $49.50 $1,831.50
        )

    [color] => Array
        (
            [0] => FTWWHT/CYBEMT/ECRTIN
            [1] => FTWWHT/POWRED/ACTRED
            [2] => FTWWHT/POWRED/ACTRED
            [3] => FTWWHT/POWRED/ACTRED
        )

    [size] => Array
        (
            [0] => 9- 11 11- 12 12- 13 14
            [1] => 9 9- 10 10- 11 11- 12 12- 13 13- 14
            [2] => 10 10- 11 11- 12 12- 13 13- 14 15
            [3] => 9 9- 10 10- 11 11- 12 12- 13 14
        )

    [quantity] => Array
        (
            [0] => 2 6 2 2 1 4 3 
            [1] => 5 5 10 8 15 10 17 9 14 4 3 
            [2] => 3 4 4 3 5 3 25 4 15 2 
            [3] => 2 1 3 4 3 2 6 3 7 6
        )

)
buildok
  • 785
  • 6
  • 7