0

i am trying to parse some files line by line and trying to identify it as columns. Two columns that are consecutive are words, but the separation pattern is more than one space. As the columns can have spaces between, i am having some trouble separating these two.

Examples of lines:

2236        ARGEMIRO PATROCINIO                                   ARGEMIRO                 I       I          UBC            3,8462

1150721     ZACHARY F CONDON                                      ZACH CONDON               I       I          FINTAGE        8,3333

50300       COMERCIAL FONOGRAFICA RGE LTDA.                                                 PF      LI         ABRAMUS       25,0000`

(fixed)

obs.: it's not showing all the spaces between '2236', 'ARGEMIRO PATROCINIO', 'ARGEMIRO', 'I', 'I', 'UBC' and '3,8462'

I am using this regex:

(\d+)\s+([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}([\.a-zA-Z\s,'À-úÀ-ÿ()\?\-\/\d]+)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})

but unfortunately, "ARGEMIRO PATROCINIO" is coming with the second "ARGEMIRO"; "ZACHARY F CONDON" with the second "ZACH CONDON" and on.

So,

  1. how can i fix this regex to separate these two "columns"?
  2. how would be another regex that can grab anything between two or more spaces within these 7 columns?

Thank you!

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 2
    `preg_split('/\s+/',..` ? – splash58 May 01 '17 at 20:52
  • 1
    I think `preg_split` looks a much tidier solution than [this "fix"](https://regex101.com/r/pOJYmb/1). – Wiktor Stribiżew May 01 '17 at 21:05
  • @WiktorStribiżew why putting those two "?" the regex begins to work? preg_split does a good job, but this regex maintains the column structure so i can detect what type of data i am reading. Can you explain? Perhaps as an answer? – Guilherme Sampaio May 01 '17 at 23:45
  • I suppose you are also using the `/U` modifier, right? It inverses the greediness. Thus, when you use `*?` or `+?` with `/U`, they are actually *greedy*. – Wiktor Stribiżew May 01 '17 at 23:48

3 Answers3

1

I'm not actually seeing double spaces in the data you pasted, but you are describing it as such. You can do this to split anywhere there is 2 or more sequential spaces:

preg_split("/[\s]{2,}/", $data);

DEMO: http://www.phpliveregex.com/p/jWZ (click "preg_split" on the right)

Jeremy Harris
  • 24,318
  • 13
  • 79
  • 133
  • Yes, when i put it into the code markup it goes to only one space. preg_split seems to do the job, but it doesn’t mantain the columns that have no data, like this one: `163587 WELLINGTON POMPEU DO NASCIMENTO MA MA 1,1857` – Guilherme Sampaio May 01 '17 at 23:27
  • There is no benefit to wrapping `\s` in square braces (the character class is completely unnecessary). Just use `/\s{2,}/`. – mickmackusa Jun 11 '22 at 05:31
0

You should understand how greediness works. Once your subpattern becomes lazy, it is first skipped, and the subsequent patterns are tried first. Only in case no match is found, the engine goes back to the pattern that is lazily quantified, matches a single char that the pattern matches and goes on testing the subsequent subpatterns again. The mechanism is similar to backtracking, but goes forward.

So, what you may do is to make sure the second and third column patterns are lazy. (Note I guess you are using /U greediness swapping modifier, and my advice is to not use it to make the pattern as clear as possible):

(\d+)\s+([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}([-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})

Add anchors (^ at the start and $ at the end) and /m modifier if you need to match full lines only.

See the regex demo.

See the [-.a-zA-Z\s,'À-úÀ-ÿ()?\/\d]+?) patterns, the have +? lazy quantifier matching 1+ chars, as few as possible.

Note I made some cosmetic changes, too: . does not need to be escaped in a character class, and -, when placed at the start of a character class, never needs to be escaped to denote a literal -.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

I would say normally this regex is needed

/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/

But since the last record only has 6 columns it won't match the last record https://regex101.com/r/YynbpP/1

My suggestion is you rethink which columns could be optional.
Then adjust the regex accordingly.

For example, group 2 and 3 are identical in structure.
If you expect the second one is optional, the proper regex is this:

/(\d+)\s{2,}([.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*)(?|\s{2,}((?:[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d]+(?:\s?[.a-zA-Z,'À-úÀ-ÿ()?\-\/\d])*))|())\s{2,}(I|PF|MA)\s{2,}(I|PF|PL|LI|MA|CV|MJ)\s{2,}(\w+)\s{2,}(\d+,\d{4})/

https://regex101.com/r/ohtTfO/2

Which maintains the column structure

Note that if the 3rd column entry is missing, it is likely that it didn't
pop in an extra \s{2,} so you can't just say the whole thing is just optional
because it would turn column 3 into a null, instead of an empty string.

To fix that I just used a branch reset
(?|\s{2,}(data)|()) which always matches column 3
and makes it an empty string if it's not there...

Formatted (for ease of use)

 ( \d+ )                                  # (1)
 \s{2,} 
 (                                        # (2 start)
      [.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+ 
      (?:
           \s? 
           [.a-zA-Z,'À-úÀ-ÿ()?\-/\d] 
      )*
 )                                        # (2 end)
 (?|
      \s{2,} 
      (                                        # (3 start)
           (?:
                [.a-zA-Z,'À-úÀ-ÿ()?\-/\d]+ 
                (?:
                     \s? 
                     [.a-zA-Z,'À-úÀ-ÿ()?\-/\d] 
                )*
           )
      )                                        # (3 end)
   |  ( )                                      # (3)
 )
 \s{2,} 
 ( I | PF | MA )                          # (4)
 \s{2,} 
 ( I | PF | PL | LI | MA | CV | MJ )      # (5)
 \s{2,} 
 ( \w+ )                                  # (6)
 \s{2,} 
 ( \d+ , \d{4} )                          # (7)