-1

I need to parse a csv file to grab some infos from each row ( Company code, Company description, Country), I'm using preg_match in PHP to parse the file but I got in trouble with some rows.

Below some rows of the csv file

"ASTA","Aerospace Technologies of Australia Pty Ltd (Australia)"
"ATAC"," American Tactical Aircraft Consultants (United States)"
"ATEC"," ATEC vos (Czech Republic)"
"ATG","Aviation Technology Group Inc (United States)"
"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"
"ATR","GIE Avions de Transport Régional (France/Italy)"
"AUSTER","Auster Aircraft Ltd (United Kingdom)"
"AUSTFLIGHT","Austflight ULA Pty Ltd (Australia)"
"AUSTRALIAN AEROSPACE","Australian Aerospace Pty Ltd (Australia)"
"AUSTRALITE","Australite Inc (United States)"
"AUTOGYRO","AutoGyro Europe GmbH (Germany)"
"AVANTAGE","OOO Samoletstroitelynyi Kompaniya Avantazh (Russia)"
"AVCRAFT","AvCraft Aviation LLC (United States)"
"AVEKO","Aveko sro (Czech Republic)"
"AVIA (1)","Azionari Vercellese Industrie Aeronautiche (Italy)"
"AVIA (2)","Avia-Zavody Jirího Dimitrova (Czech Republic)"

The PHP preg_match code is the following

preg_match('#^(.+?)\s\((.+?)\)$#',$string,$matches);

The code works fine with rows like the following one:

"ASSO AEREI","Asso Aerei Srl (Italy)"

In the example above I succesfully get the three datas into matches array...but with the following row

"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"

I get, as Company Description:

Atlas Aircraft Corporation of South Africa

and as Country:

Pty) Ltd (South Africa

They should be, instead:

Atlas Aircraft Corporation of South Africa (Pty) Ltd

and

South Africa

One more issue that is getting me crazy is: when the rows doesn't include a country, like the following row

"AERFER-AERMACCHI","see AERFER and AERMACCHI"

I get an empty Company description array.

Any help to fix the regex pattern? Yhanks a lot for any help

Federico
  • 319
  • 2
  • 14

4 Answers4

2
$csv = <<<'EOD'
"ASTA","Aerospace Technologies of Australia Pty Ltd (Australia)"
"ATAC"," American Tactical Aircraft Consultants (United States)"
"ATEC"," ATEC vos (Czech Republic)"
"ATG","Aviation Technology Group Inc (United States)"
"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"
"ATR","GIE Avions de Transport Régional (France/Italy)"
"AUSTER","Auster Aircraft Ltd (United Kingdom)"
"AUSTFLIGHT","Austflight ULA Pty Ltd (Australia)"
"AUSTRALIAN AEROSPACE","Australian Aerospace Pty Ltd (Australia)"
"AUSTRALITE","Australite Inc (United States)"
"AUTOGYRO","AutoGyro Europe GmbH (Germany)"
"AVANTAGE","OOO Samoletstroitelynyi Kompaniya Avantazh (Russia)"
"AVCRAFT","AvCraft Aviation LLC (United States)"
"AVEKO","Aveko sro (Czech Republic)"
"AVIA (1)","Azionari Vercellese Industrie Aeronautiche (Italy)"
"AVIA (2)","Avia-Zavody Jirího Dimitrova (Czech Republic)"
"AERFER-AERMACCHI","see AERFER and AERMACCHI"
EOD;

$url = 'data:text/plain,' . urlencode($csv);

if ( false !== $handle = fopen($url, "r") ) {
    while ( false !== $data = fgetcsv($handle) ) {
        if ( preg_match('~(\S.*?)(?|\h*\(([^)]*)\)|())\h*$~', $data[1], $m) )
            printf("%-70s\t%s\n", $m[1], $m[2]);

    }
}

demo

The pattern explained:

Two important things from your question:

  • the country can be optional
  • the description can also contains parenthesis

That is why I used a non-greedy quantifier for the description part (\S.*?). This way even if the country name is present the description subpattern is forced to stop at the opening parenthesis (only if this one is at the end of the string).

The \S at the begining is only here to trim the description part on the left. That's also why the pattern doesn't with the ^ anchor. One of the \h* will trim it on the right (due to the non-greedy quantifier too).

About the country part: instead of using an optional non capturing group like (?:\h*\(([^)]*)\))?, I choosed to use a branch reset group (?|... (...) ... | ... (...) ...) to ensure that the capture group 2 exists, even if the country isn't present. In this kind of group the capture groups have the same number in each branch:

(?|
    \h* \( ([^)]*) \) # the country name is present and captured in group 2
  |   # OR
    () # the capture group 2 contains an empty string
)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks a lot Casimir, works like a charm. I have lot to learn about regex so your answer help me to better understand how regex works. I just got in trouble with a data in the CSV file that is not correct, The row is the following one: "SG AVIATION","SG Aviation (Italy))". As you can see there is two parentheses at the end of the field...but I can leave with it...I can manually edit the row....if there is no easy way to handle this exception. Thanks again – Federico Jun 23 '19 at 15:52
  • 1
    @Federico: What you can do to easily solve the problem is to add the `+` quantifier after the literal closing parenthesis (the one escaped) in the pattern. – Casimir et Hippolyte Jun 23 '19 at 16:12
  • Sorry for getting back to this discussion...but I get in trouble with another situation: in the country description code there could be a figure in parentheses...I need to keep it. For example in the following row: "SAAB-FAIRCHILD","see SAAB and FAIRCHILD (1)" How could I edit the pattern to keep the (1) in the string? Thanks a lot. By the way...regex are really a pain...even if interesting – Federico Aug 13 '19 at 13:49
  • fixed...using: ~(\S.*?)(?|\h*\(([^)0-9]*)\)+|())\h*$~ – Federico Aug 14 '19 at 09:16
  • 1
    @Federico: well done, you can also enforce the first character to be a letter: `~(\S.*?)(?|\h*\((\pL[^)]*)\)|())\h*$` – Casimir et Hippolyte Aug 15 '19 at 19:34
  • thanks a lot for the invaluable help and hints....I'm learning a lot reading your feedback. – Federico Aug 17 '19 at 07:06
1

It is better to use fgetcsv() function instead of preg_match.

$file = fopen("contacts.csv","r");
print_r(fgetcsv($file));
fclose($file);

You can find reference of this function here fgetcsv()

Nitesh Garg
  • 179
  • 3
  • but the data he is trying to extract is from inside one of the CSV fields so that doesn't seem to be the problem – Tomerikoo Jun 22 '19 at 15:40
  • You are right Tomerikoo. Sorry if my first post was not so clear...I already use fgetcsv to split each row data. – Federico Jun 28 '19 at 09:50
1

This regex captured all the options:

"/^(.*?)(\(([^(]*?)\))?$/"

I tried the following code:

$matches=array();
$re = "/^(.*?)(\(([^(]*?)\))?$/";
preg_match($re, $string, $matches);
foreach( $matches as $match ){
    echo $match."\n";
}

And when run with:

$string = "Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)";

Output is:

Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)
Atlas Aircraft Corporation of South Africa (Pty) Ltd 
(South Africa)
South Africa

When run with

$string = "see AERFER and AERMACCHI"

Output is:

see AERFER and AERMACCHI
see AERFER and AERMACCHI

So you get company description in $matches[1] and Country in $matches[3]

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
0

My guess is that this expression might work:

(.*)\s*\((.*?)\)|(.*)

which collects our desired data with () using

(.*)\s*\((.*?)\)

and others without, using

(.*)

Demo

Test

$re = '/(.*)\s*\((.*?)\)|(.*)/m';
$str = 'Aerospace Technologies of Australia Pty Ltd (Australia)
American Tactical Aircraft Consultants (United States)
ATEC vos (Czech Republic)
Aviation Technology Group Inc (United States)
Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)
GIE Avions de Transport Régional (France/Italy)
Auster Aircraft Ltd (United Kingdom)
Austflight ULA Pty Ltd (Australia)
Australian Aerospace Pty Ltd (Australia)
Australite Inc (United States)
AutoGyro Europe GmbH (Germany)
OOO Samoletstroitelynyi Kompaniya Avantazh (Russia)
AvCraft Aviation LLC (United States)
Aveko sro (Czech Republic)
Azionari Vercellese Industrie Aeronautiche (Italy)
Avia-Zavody Jirího Dimitrova (Czech Republic)
see AERFER and AERMACCHI';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Emma
  • 27,428
  • 11
  • 44
  • 69