Can't seem to get correct regex for sed command

Question

I have a CSV file where I need to replace the occurrence of a double quote followed by a line feed with a string i.e. "XXXX"

I've tried the following:

LC_CTYPE=C && LANG=C && sed 's/\"\n/XXXX/g' < input_file.csv > output_file.csv

and

LC_CTYPE=C && LANG=C && sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv

also tried

sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv

In each case, the command does not seem to recognize the specific combination of "\n in the file

It works if I look for just the double quote:

sed 's/\"/XXXX/g' < input_file.csv > output_file.csv

and if I look for just the line feed:

sed 's/\n\r/XXXX/g' < input_file.csv > output_file.csv

But no luck with the find-replace for the combined regex string

Any guidance would be most appreciated.

Adding simplified sample data

Sample input data (header row and two example records):

column1,column2
data,data<cr>
data,data"<cr>

Sample output:

column1,column2
data,data<cr>
data,dataXXXX

Update: Having some luck using perl commands in bash (MacOS) to get this done:

perl -pe 's/\"/XXXX/' input.csv > output1.csv

then

perl -pe 's/\n/YYYY/' output1.csv > output2.csv

this results in XXXXYYYY at the end of each record

I'm sure there is an easier way, but this seems to be doing the trick on a test file I've been using. Trying it out there before I use on the original 200K-line csv file.

Please add sample input and your desired output for that sample input to your question. — Cyrus, Nov 28 '15 at 23:18
Is that really the briefest sample input/output you could come up with that demonstrates your problem? If so I suspect a lot of people won't bother trying to understand it all. Btw, `\n` is non-POSIX (and so non-portable) in sed - use a backslash followed by a literal newline instead. — Ed Morton, Nov 29 '15 at 00:29
@cho_joe : why are you doing `LC_CTYPE=C && LANG=C && sed ...`? The typical usage is ..............................`LC_CTYPE=C LANG=C sed ...` so that these env. variables are set only for the `sed` process on the same line. Good luck. — shellter, Nov 29 '15 at 00:39
@EdMorton: The shortest example would be just a row of data with a " and a newline at the end. I'd like to find the rows that have a " and newline at the end and replace with XXXX — cho_joe, Nov 29 '15 at 00:42
@shellter: I was getting an "illegal byte sequence" error earlier and found this thread: http://stackoverflow.com/questions/11287564/getting-sed-error-illegal-byte-sequence-in-bash — cho_joe, Nov 29 '15 at 00:45
@cho_joe : ok, I don't see the `L` vars chained together with `&&`s in that answer. You don't need it, and you're just leaving a maintenance issue for someone else to unnecessarily copy, OR puzzle over and search for "why did 'cho_joe' do it that way!?" . Good job researching your problem. Good luck. — shellter, Nov 29 '15 at 01:13
@cho_joe don't tell us about the shortest example in a comment. Edit your question to make your sample input and expected output the most concise possible example that captures all your requirements. If you don't fix your question soon everyone who might have been able to help you will have looked at it, decided you didn't put enough effort into making it as clear and simple as possible for us to understand and moved on to try to help someone else. — Ed Morton, Nov 29 '15 at 01:16
Are the CR characters part of the data — and a part of the problem? Do you simply need to convert from DOS (Windows) line endings to Unix line endings? Otherwise, why isn't it just `sed 's/"$/"XXX"/'`? — Jonathan Leffler, Nov 29 '15 at 02:55
@JonathanLeffler: sorry, no---the characters are not part of the data. that text is in the above example to represent the new line at end of each row of data in the file — cho_joe, Nov 29 '15 at 03:03
OK; on the whole, it is clearest just to leave the line breaks in the output unmarked. Given what I see, then my `sed` script suggestion can be simplified to: `sed 's/"$/XXXX/'` — I believe that does what you say you need done. — Jonathan Leffler, Nov 29 '15 at 03:06
@JonathanLeffler: thanks---I think changing only the occurrences of the " will replace " I do not want replaced. I want to replace only the occurrences of " that are paired with a new line — cho_joe, Nov 29 '15 at 03:20
The `$` after the `"` says 'double quote at end of line' (meaning 'double quote followed by newline'). So, unless there's something going on I've not spotted, I have to disagree with your assessment. Have you tried it? Which line did it change that it should not have done? (This is where the presence or absence of CR carriage return matters; things would be different if there were CR characters at the end of the line.) — Jonathan Leffler, Nov 29 '15 at 03:22
would this edit to your script address that need? sed 's/"$\n/XXXX/' — cho_joe, Nov 29 '15 at 03:22
@JonathanLeffler: sed 's/"$/XXXX/' replaced only the " where paired with the new line, but it did not replace the new line at same time (which is needed). I've had success using the perl script i appended to my original question. Thank you for your input! — cho_joe, Nov 29 '15 at 03:29
Oh, so you want to combine the line with a quote at the end with the data from the next line, with XXXX replacing the quote and the newline? That's not what your sample output shows, hence my confusion. That's doable in `sed` too: `sed '/"$/ { N; s/"\n/XXXX/; }'` (find a line ending with double quote; read the next line in and append it to the current line after a newline; substitute the double quote and newline by XXXX). If the last line in the file ends with a double quote, that line gets dropped - you were warned. — Jonathan Leffler, Nov 29 '15 at 03:33
What processing do you want with the line(s) following the `"` ? When the second lines are `fieldfullofreturns"` they all seem to be part of column 2. — Walter A, Nov 29 '15 at 14:22
@JonathanLeffler: Thanks. I agree my sample data was not clear regarding treatment of next row after XXXX. Apologies! I figured this out using a perl script and a few steps to get at the "s in the data and then the " combos. I really appreciate your input. — cho_joe, Nov 29 '15 at 22:25

score 3 · Answer 1 · answered Nov 29 '15 at 05:18

sed is for simple substitutions on individual lines, that is all, so this is not a job for sed.

It sounds like this is what you want (uses GNU awk for multi-char RS):

$ awk -v RS='"\n' -v ORS='XXXX' '1' file
column1,column2
data,data
data,dataXXXX$

That final $ above is my prompt, demonstrating that both the " and the subsequent newline have been replaced.

Casimir et Hippolyte · Answer 2 · 2015-11-28T23:47:39.850

1

You can try something like this:

sed ':a;/"\r\?$/{N;s/"\r\?\n\|"\r\?$/XXXX/;ba;}'

details:

:a                  # define the label "a"
/"\r\?$/            # condition: if the line ends with " then:
{
    N               # add the next line to the pattern space
    s/              # replace:
         "\r\?\n    # the " and the LF (or CRLF) 
      \|
         "\r\?$     # or a " at the end of the added line
                    # (this second alternative is only tested at the end
                    #  of the file)
     /XXXX/         # with XXXX
    ba              # go to label a
}

edited Nov 28 '15 at 23:47

answered Nov 28 '15 at 23:22

Casimir et Hippolyte

88,009
5
94
125

thank you, but the your first suggestion did not find the "\n string in my file – cho_joe Nov 28 '15 at 23:34
2

@user1638755: perhaps your file uses a windows format, in this case use `\r\?\n` instead of `\n`. – Casimir et Hippolyte Nov 28 '15 at 23:38
thx---the file was "macroman" -- changed it to utf-8 and tried both versions without success – cho_joe Nov 28 '15 at 23:44
1

@CasimiretHippolyte has a good suggestion. A sanity check would be to look at the CSV file in VI or something else that shows non-printable characters. – cowboydan Nov 28 '15 at 23:44
@cowboydan -- thank you -- i'm viewing the csv file in textmate which does show hidden characters -- definitely willing to try VI if that might help to see if anything else is in the file that needs to be added to the regex string – cho_joe Nov 29 '15 at 00:00

Can't seem to get correct regex for sed command

2 Answers2