Regular expression to remove commas after the first

Question

I have a file that looks like:

16262|John, Doe|John|Doe|JD|etc...

I need to find and replace cases as:

16262|John, Doe, Dae|John|Doe Dae|JD|etc...

by

16262|John, Doe Dae|John|Doe Dae|JD|etc...

In summary, I want to alter in the second field the commas after the first (may be more than one after).

Any suggestion?

Similar examples: [Example of VB.NET - Question 12116586](http://stackoverflow.com/questions/12116586/replace-all-but-last-instance-of-specified-character), [JavaScript example](http://stackoverflow.com/questions/7959975/how-to-replace-all-but-the-first-occurence-of-a-pattern-in-string) — zedfoxus, May 15 '15 at 01:49
I tried something like this but only finds the second comma, i want to do the same but with all the possible occurrences: (\w, [A-Z]\w+,) — chan go, May 15 '15 at 01:53
Is linux, so, bash, awk, sed, perl, suggestions are welcome. — chan go, May 15 '15 at 01:54

Casimir et Hippolyte · Accepted Answer · 2015-05-15T02:56:14.067

2

With gnu sed:

BRE syntax:

sed 's/\(\(^\||\)[^|,]*,\) \?\|, \?/\1 /g;'

ERE syntax:

sed -r 's/((^|\|)[^|,]*,) ?|, ?/\1 /g;'

details:

(          # group 1: all the begining of an item until the first comma
    (      # group 2:
        ^  # start of the line
      |    # OR
        \| # delimiter
    )
    [^|,]* # start of the item until | or ,
    ,      # the first comma
)          # close the capture group 1
[ ]?       # optional space
|        # OR  
,          # an other comma
[ ]?

When the first branch succeeds, the first comma is captured in the group 1 with all the begining of the item, since the replacement string contains a reference to the capture group 1 (\1), so the first comma stay unchanged.

When the second branch succeeds the group 1 is not defined and the reference \1 in the replacement string is an empty string. This is why other commas are removed.

edited May 15 '15 at 02:56

answered May 15 '15 at 02:07

Casimir et Hippolyte

88,009
5
94
125

Looks very good, thanks. How i can run this in a efficient way in a text file with more than 2.3 millions of lines? – chan go May 15 '15 at 02:14
@chango: sed is a stream editor, so, it could be able to treat a file in place whatever the size. However, I suggest you to make some test before with a large block to see what happens. To process a file in place: `sed -i.bak 's/...../g' file.txt` (`.bak` is the extension to backup the original file, you can remove it when you are sure) – Casimir et Hippolyte May 15 '15 at 02:27
@chango: Works for me, paste this on your terminal: `echo '8202|John, Doe, Antonio|Doe Antonio|John|' | sed -r 's/((^|\|)[^|,]*,) ?|, ?/\1 /g;'` (eventually replace `-r` with `-E` or test the BRE version) – Casimir et Hippolyte May 15 '15 at 02:46
Awesome, thanks, sed in osx is bsd and doesn't have -r. I tried in linux and works perfect! – chan go May 15 '15 at 02:48

score 0 · Answer 2 · answered May 15 '15 at 01:48

This strongly depends on languages. If you have lookbehind you can do this with the regular expression (?<=,.*),. If you don't have that, for example in JavaScript, you might still be able to use lookahead if you can reverse a string:

String.prototype.reverse = function () {
    return this.split("").reverse().join("");
};
"a, b, c, d".reverse().replace(/,(?=.*,)/g, '').reverse()
// yields "a, b c d"

I don't think there are other features which are quite like lookaround in regex that can easily simulate them. Possibly you can use a more powerful language to capture the index of the first comma, replace all commas, and then reinsert the first comma.

What do you recommend to do the conversion in a very big text file? More than 2.3 millions of lines. — chan go, May 15 '15 at 02:15

Regular expression to remove commas after the first

2 Answers2