CSV file pattern matching matches separators also

Question

I have used this regex (?:^|;)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=;|$) which is slightly adapted from a solution in this question

My problem with it is it also matches the separating semicolon, which i then have to remove manually, which is bad style.

String separator = ";";
String patternString = "(?:^|" + separator + ")\\s*(?:(?:(?=\")\"([^\"].*?)\")|(?:(?!\")(.*?)))(?=" + separator + "|$)";
pattern = Pattern.compile(patternString);

Matcher match = pattern.matcher(line);
                        // Find all cells and add them to row.
                        while (match.find())
                        {
                                String cell = match.group();
                                //HACK something with the pattern to match is wrong
                                if(cell.indexOf(";") == 0) cell = cell.substring(1);
                                cell = unescapeCsv(cell);
                                //do something with cell
}

I try to match this, and it has to match a column (see last column) that is quoted and has semicolon as part of the data:

Name;inst_type;Position;Currency;cftype;amount;tenor;fwterm;compoundtype;resetterm;histrates;sdate;edate;fwdate;callput;capfloor;strike;vola;volavalue;ulspot;ulspotvalue;divcurve;cleanflag;fixrate;compfr;dayc;refindex;spread;floatfactor;paystart;payend;annuity;amortizingtype;cgm;resrate;nextrate;isprorated;rolldate;rollday;islongstub;isarrear;payrule;paydays;paycal;resrule;resdays;rescal;isfixcoupon;spreadcurve;cds_spread_value;recovrate;payoutrate;payrec;creditspread;disc;"C""F"
Bond1;;1;EUR;2;100;6M;;;;;01.09.2007;01.09.2010;;;;;;;;;;;0,0625;1;1;;;;0;100;;;1;;;1;01.06.2008;;1;;;;;;;;;;;;;;;IR-EUR;"2;01062008;100;0;0.0625;;01092007;01062008";"2;01122008;100;0;0.0625;;01062008;01122008";"2;01062009;100;0;0.0625;;01122008;01062009";"2;01122009;100;0;0.0625;;01062009;01122009";"2;01092010;100;0;0.0625;;01122009;01092010";"1;01092010;100";
Bond2;;1;EUR;2;100;6M;;;;;01.09.2007;01.09.2010;;;;;;;;;;;0,0625;1;1;;;;0;100;;;-1;;;1;;;;;;;;;;;;;;;;;;IR-EUR;"2;01092010;100;0;0.0625;;01032010;01092010";"2;01032010;100;0;0.0625;;01092009;01032010";"2;01092009;100;0;0.0625;;01032009;01092009";"2;01032009;100;0;0.0625;;01092008;01032009";"2;01092008;100;0;0.0625;;01032008;01092008";"2;01032008;100;0;0.0625;;01092007;01032008";"1;01092010;100";

Why are you doing this to yourself? Just use a CSV parsing library. Or write a simple tokenizer which doesn't use regex. — Andy Turner, May 19 '17 at 09:49
You used capturing group in your regex. All you need to do now is to replace `String cell = match.group()` by `String cell = match.group(1)`. This will give you group 1 instead of full match — Gawil, May 19 '17 at 10:05
Well I just saw you had 2 groups, so it will be a bit more difficult... — Gawil, May 19 '17 at 10:16

Stephane Janicaud · Accepted Answer · 2017-05-19T11:01:57.660

2

Try this one, it should work : (?<=^|;)".*?"(?=;|$)|(?<=^|;)[^;]*(?=;|$)

Explanation

(?<=^|;)".*?"(?=;|$) Any character between double-quotes zero to unlimited times precedeed by start anchor or a semicolon and followed by a semicolon or end anchor

| OR

(?<=^|;)[^;]*(?=;|$) Any character but a semicolon zero to unlimited times precedeed by start anchor or a semi-colon and followed by a semicolon or end anchor

Use * instead of + is important to match empty columns

Demo

edited May 19 '17 at 11:01

answered May 19 '17 at 10:55

Stephane Janicaud

3,531
1
12
18

There is still a problem, `"3;01022007;100;0;;IR-EUR;01082006;01022007;01082006;6M;0;1";"3;01022007;100;0;;IR-EUR;01082006;01022007;01082006;6M;0;2";"3;01022007;100;0;;IR-EUR;01082006;01022007;01082006;6M;0;3"` only the first of these is matched – Adder May 19 '17 at 11:52
Cut and paste your string into regex101 editor, it contains hidden characters that breaks the pattern : https://regex101.com/r/SSLZRN/1/ – Stephane Janicaud May 19 '17 at 12:12
Thanks it works - the error is elsewhere in the data representation – Adder May 19 '17 at 12:39

score 0 · Answer 2 · answered May 19 '17 at 10:05

0

Just use normal CSV parser:

answered May 19 '17 at 10:05

grundic

4,641
3
31
47

CSV file pattern matching matches separators also

2 Answers2