1

I'm in a pickle where I have the following set of lines:

John Smith
John Smith +1
John Smith (drink)
John Smith              (    drink      )         
John Smith, drink
John Smith   ,    drink
John Smith   +1   ,    drink
John Smith +1 (drink)
John Smith +1, drink
John Smith +1 drink

What I need to do is get them into an array like

'array' => 
    'name' => 'John Smith',
    'plus' => '',
    'comment' => ''
,
'array' =>
    'name' => 'John Smith',
    'plus' => '+1',
    'comment' => ''
,
'array' => 
    'name' => 'John Smith',
    'plus' => '',
    'comment' => 'drink'

and so on ... which seems like I need some Google level regex-es here. I so far explode the entire .txt file with \n and foreach the lines and then explode by space and but then I just find myself in the middle of a hell of a mess. So if anyone has any better ideas on how to do so, I would kill for that knowledge. Any help is appreciated. By any I mean any kind that at all.

Codefoe
  • 31
  • 1
  • 1
  • 5
  • What should the ones with lots of spaces look like? And the ones with ',' in them? – ellak Mar 01 '13 at 13:12
  • Clients are idiots so the lines may look extremely different. But all I need is the name, the optional plus and a comment. So that I could later on validate the name to have at least 2 names and not include any weird characters. Commas are useless so they can be stripped because not all lines have commas and the closest I could come to anything is exploding my space which actually didn't get me anywhere still. – Codefoe Mar 01 '13 at 13:14
  • I wouldn't explode by spaces or you are sure "comment" will have none of them. – Voitcus Mar 01 '13 at 13:17
  • Am not. That's the thing, a ton of spaces and no earthly idea on how to make it all make some sense. – Codefoe Mar 01 '13 at 13:19
  • What about O'Connors or Mc Gregor? – Toto Mar 01 '13 at 13:44
  • Clients may or may not be idiots, but where is the data coming from? How are they generating it? Are different clients providing different formats, or are all those possible variations coming from a single client? Do you really have no control at all over the data format? – SDC Mar 01 '13 at 13:51
  • by the way, re "validating that the name has at least two names and no weird characters" -- I feel I should point you toward [this answer](http://stackoverflow.com/questions/3853346/how-to-validate-human-names-in-cakephp/3853820#3853820) – SDC Mar 01 '13 at 14:05
  • Is the second word either always a number or preceded by a comma or paren? – Explosion Pills Mar 01 '13 at 14:05

2 Answers2

0

Let me present you a very brittle solution that works with your example string:

^ *+([A-Za-z ]*[A-Za-z]) *+(\+\d+)?+ *+(?|,?+ *+\( *+(.*\S) *\) *|,?+ *+(.*\S) *)?$

Name will be in capturing group 1. Number (sign included) will be in capturing group 2. Comment will be in capturing group 3.

Currently, the assumption is that name can only contain space and English alphabet.

Another assumption is that only space (ASCII 32) is recognized as spacing character.

Demo (Please ignore the flags, they are for demonstration purpose only).

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • `$match = preg_match('^ *+([A-Za-z ]*[A-Za-z]) *+(\+\d+)?+ *+(?|,?+ *+\( *+(.*\S) *\) *|,?+ *+(.*\S) *)?$/', $content, $matches);` returns in an error `preg_match() No ending delimiter '^'` – Codefoe Mar 01 '13 at 14:16
  • @Codefoe - for the above regex in preg_match, add a `/` at each end as a delimiter. – SDC Mar 01 '13 at 14:26
0

Another brittle regex for the road that works with your sample

$lines = array
(
"John Smith",
"John Smith +1",
"John Smith (drink)",
"John Smith              (    drink      )",
"John Smith, drink",
"John Smith   ,    drink",
"John Smith   +1   ,    drink",
"John Smith +1 (drink)",
"John Smith +1, drink",
"John Smith +1 drink"
);

foreach($lines as $line)
{
    preg_match('/^(?<name>\w+(?:\s+\w+)?)(?:[\s,]+(?<plus>\+\d+))?(?:[\s,\(]+(?<comment>\w+)[\s\)]*)?$/', $line, $matches);
    var_dump($matches);
}
Timothée Groleau
  • 1,940
  • 13
  • 16