0

I have string with some special characters. The aim is to retrieve String[] of each line (, separated) You have special character “ where you can have /n and ,

For example Main String
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL
Titi",God," timmy, tomy,tony,
tini".

You can see that there are you /n in "".

Can any Help me to Parse this.

Thanks

__ More Explanation

with the Main Sting I need to separate these

Here Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie,KameL,Titi
God
timmy, tomy,tony,tini

Problem is : for Julie,KameL,Titi there is line break /n or
in between KameL and Titi similar problem for timmy, tomy,tony,tini there is line break /n or
in between tony and tini.


new this text is in file (compulsory line by line reading)

Alpha,Beta Charli,Delta,Delta Echo ,Frank George,Henry
1234-5,"Ida, John
 ", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln 
 ",Mary / New York,123456
12543-01,"Ocean, Peter

output i want to remove this "

Alpha
Beta Charli
Delta
Delta Echo
Frank George
Henry
1234-5
Ida
John
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Mary / New York
123456
12543-01
Ocean
Peter
GameBuilder
  • 1,169
  • 4
  • 31
  • 62
  • So "Ida" and "John" should appear on two different lines? I thought you need: "1234-5" on one line. Then "Ida, John" on the second line and "25/11/1964" on the third (without the quotes) because the quotes embrace Ida and John into one single string. – 1000ml May 18 '13 at 22:39

4 Answers4

5

Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.

fvu
  • 32,488
  • 6
  • 61
  • 79
3

Description

Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.

(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)

Example

$Matches = @()
$String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"'
$Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'

Write-Host start with 
write-host $String
Write-Host
Write-Host found
([regex]"(?i)(?m)$Regex").matches($String) | foreach {
    write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
    } # next match

Yields

start with
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"

found
key at 0 = ''   = value at 0 = 'Alpha'
key at 6 = ''   = value at 6 = 'Beta'
key at 11 = ''  = value at 11 = 'Gama'
key at 16 = '"' = value at 17 = '23-5-2013,TOM'
key at 32 = ''  = value at 32 = 'TOTO'
key at 37 = '"' = value at 38 = 'Julie, KameL\n
Titi'
key at 60 = ''  = value at 60 = 'God'
key at 64 = '"' = value at 65 = 'timmy, \n
tomy,tony,tini'

Summary

enter image description here

  • (?: start non capture group
  • ^ require start of string
  • | or
  • ,\s{0,} a comma followed by any number of white space
  • ) close the non capture group
  • ( start capture group 1
  • ["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote
  • ) close capture group 1
  • \s{0,} consume any spaces if they exist, this means you don't need to trim the value later
  • ( start capture group 2
  • (?:.|\n|\r)*? capture all characters including a new line, non greedy
  • ) close capture group 2
  • \1 if there was a quote it would be stored in group 1, so if there was one then require it here
  • (?= start zero assertion look ahead
  • [,]\s{0,} must have a comma followed by optional whitespace
  • | or
  • $ end of the string
  • ) close the zero assertion look ahead
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • How did you generate that diagram? – mattalxndr Sep 17 '15 at 23:17
  • there are a few mistakes in this: the `\s{0,}` in your lookahead is meaningless, you are consuming leading spaces inside a quoted string, and other inconsistent handling of whitespaces. but the worst problem is no consideration of escaped backslashes... which is why most developers would advise using a csv parser instead of a regex. – Patrick Parker Aug 27 '18 at 13:18
  • I'm not going to debate the merrits of using Regex for parsing CSV. Requesters generally has their reasons for using Regex over other more capable parsing solutions. I see you also posted a Regex based solution, so your feeling here is moot. The `{0,}` construct says zero or more times and is replicated by the `*`. Including support for escaped backslashes was not required by OP. – Ro Yo Mi Aug 28 '18 at 16:43
  • I am aware of what the `{0,}` construct means, but my point was that including optional trailing space for a lookahead does nothing in particular, so it was extraneous. The comment about moot feelings seems a little off topic. For what it's worth, I posted what I consider a "decent" regex, just to save people searching for one, but I still agree with [the answer by fgv](https://stackoverflow.com/a/16533149/7098259). I'll update my answer to make that clearer. – Patrick Parker Aug 29 '18 at 17:05
1

Try this:

String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
              + "Titi\",God,\" timmy, tomy,tony,\n"
              + "tini\".";

Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
Matcher m = p.matcher(source);

while(m.find())
{
    if(m.group(2) != null)
        System.out.println( m.group(2).replace("\n", "") );
    else if(m.group(3) != null)
        System.out.println( m.group(3).replace("\n", "") );
}

If it matches a string without quotes, the result is returned in group 2. Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block. You might find a prettier way.

Output:
Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie, KameLTiti
God
timmy, tomy,tony,tini
.

1000ml
  • 864
  • 6
  • 14
  • ya this works. Can you explain me `("(([^\"][^,]*)|\"([^\"]*)\"),?")` – GameBuilder May 14 '13 at 11:44
  • :This work fine for the given input. Basically This input is in a file and i have to read the file line by line(compulsory to read line by line). I read the file line by line using BufferReader and treat this line as a source in your code. Because of line by reading i could not remove " . I have attached the input and output in Edit Section – GameBuilder May 14 '13 at 17:16
  • note: `[^\"]` could match a comma. if any of the entries are empty string, then this regexp will fail. – Patrick Parker Aug 27 '18 at 13:25
0

See this related answer for a decent Java-compatible regex for parsing CSV.

It recognizes:

  • Newlines (after values or inside quoted values)
  • Quoted values containing escaped double-quotes like ""this""

In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

Then collect each Matcher group(1) in a find() loop.


Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.

Patrick Parker
  • 4,863
  • 4
  • 19
  • 51
  • This doesn't appear to work correctly with escaped double-double qouotes, such as strings like `"""alpha""", "beta,""charlie"",delta"`. See also [this link](https://www.regexplanet.com/share/index.html?share=yyyyy3bxxyr) and click the Java button. – Ro Yo Mi Aug 29 '18 at 15:02
  • @RoYoMi for the example you posted, it fails because there is a space between `,` and `"beta`. I'm not sure if that should be considered a valid input or not. Sure, it would be nice to make the regex more robust, but if you are using a regex then you had better have well-formed input. – Patrick Parker Aug 29 '18 at 16:59