Delimiter substitution excluding character inside string column

Question

I have got a 3Gb file in this format:

...
201211 001093223359 "PLANO ESPECIAL" "PLANO NOVO"
201211 001199175239 "PLANO ESPECIAL" "PLANO NOVO"
201211 001292676219 "PLANO ESPECIAL" "PLANO NOVO"
...

I need to change its format to the one presented below:

...
201211;001093223359;"PLANO ESPECIAL";"PLANO NOVO"
201211;001199175239;"PLANO ESPECIAL";"PLANO NOVO"
201211;001292676219;"PLANO ESPECIAL";"PLANO NOVO"
...

First 5 lines exactly as they are in the input file:

"Mes_Referencia" "Num_Telefone" "Dsc_Plano_Tarifario" "Grupo Plano"
201211 2183223350 "INFINITY PR?" "PLANO INFINITY"
201211 2169175232 "INFINITY PR?" "PLANO INFINITY"
201211 2182676211 "INFINITY PR?" "PLANO INFINITY"
201211 2281699337 "INFINITY PR?" "PLANO INFINITY"
201211 2179173096 "INFINITY PR?" "PLANO INFINITY"

Comment: The ? in "INFINITY PR?" is because it is "É" (it is in Portuguese - Brazil).

How can I change de delimiter " " (space) into ";" without changing the spaces within the strings in the last two columns?

Thanks in advance!

I don't know how to include break lines. The date 201211 is the first field and the string PLANO NOVO is the last field of each record. — Lucas Rezende, Feb 22 '13 at 18:49

Mikhail Vladimirov · Answer 1 · 2013-02-22T19:11:23.900

2

Filter your file through the following command:

sed -E -e 's/ ([^ "]*|"[^"]*")/;\1/g'

This command assumes that first column is not quoted. If it could be, regular expression will be slightly more complicated.

Sample input:

201211 2183223350 "INFINITY PRE" "PLANO INFINITY"
201211 2182067250 "ASS. PLANO NOSSO MODO-G11" "OUTROS"
201211 8199712912 "TIM LIBERTY CONTROLE" "PLANO LIBERTY"

Sample output:

201211;2183223350;"INFINITY PRE";"PLANO INFINITY"
201211;2182067250;"ASS. PLANO NOSSO MODO-G11";"OUTROS"
201211;8199712912;"TIM LIBERTY CONTROLE";"PLANO LIBERTY"

edited Feb 22 '13 at 19:11

answered Feb 22 '13 at 18:59

Mikhail Vladimirov

13,572
1
38
40

Almost... it was like: `201211 2183223350 "INFINITY PRE" "PLANO INFINITY" 201211 2182067250 "ASS. PLANO NOSSO MODO-G11" "OUTROS" 201211 8199712912 "TIM LIBERTY CONTROLE" "PLANO LIBERTY"` And became: `201211;2183223350;"INFINITY;PRE";"PLANO INFINITY" 201211;2182067250;"ASS. PLANO NOSSO MODO-G11";"OUTROS" 201211;8199712912;"TIM LIBERTY CONTROLE";"PLANO LIBERTY"` – Lucas Rezende Feb 22 '13 at 19:08
@LucasRezende Works fine for me. See example output. – Mikhail Vladimirov Feb 22 '13 at 19:14
@LucasRezende Mikhail's version produces the correct sample output on my end. – Matt LaFave Feb 22 '13 at 19:19
really? that was the full command I used to make the test: `head -30 Liberty_Infinity.txt | sed -E -e 's/ ([^ "]*|"[^"]*")/;\1/g' > TESTE_LIBERTY.txt` – Lucas Rezende Feb 22 '13 at 19:19
`head -5 Liberty_Infinity.txt` 201211 2183223350 "INFINITY PR?" "PLANO INFINITY" 201211 2169175232 "INFINITY PR?" "PLANO INFINITY" 201211 2182676211 "INFINITY PR?" "PLANO INFINITY" 201211 2281699337 "INFINITY PR?" "PLANO INFINITY" – Lucas Rezende Feb 22 '13 at 19:21
`head -5 Liberty_Infinity.txt | sed -E -e 's/ ([^ "]*|"[^"]*")/;\1/g'` 201211;2183223350;"INFINITY;PR?";"PLANO INFINITY" 201211;2169175232;"INFINITY;PR?";"PLANO INFINITY" 201211;2182676211;"INFINITY;PR?";"PLANO INFINITY" 201211;2281699337;"INFINITY;PR?";"PLANO INFINITY" – Lucas Rezende Feb 22 '13 at 19:22
There is a header in the file: "Mes_Referencia" "Num_Telefone" "Dsc_Plano_Tarifario" "Grupo Plano" May this influence on the output format? – Lucas Rezende Feb 22 '13 at 19:26
Could you please add first 5 lines of your input file into your original question exactly as they are including header? – Mikhail Vladimirov Feb 22 '13 at 19:31
@LucasRezende Oh! Looks like `sed` incorrectly understands encoding of your file cause it probably is not 8-bit. Do you know what is the encoding? Maybe this will help: http://stackoverflow.com/questions/67410/why-does-sed-fail-with-international-characters-and-how-to-fix – Mikhail Vladimirov Feb 22 '13 at 19:41
Hummm... perhaps! i don't know how to check it. My unix is Bash but I don't know how to check it with iconv. ENCA doesn't work in bash. – Lucas Rezende Feb 22 '13 at 19:53
Got it. File is ISO-8859. So I need to convert it to UTF-8, right? – Lucas Rezende Feb 22 '13 at 19:55
Well, seems that worked. First I had to convert using: `iconv -f ISO-8859-1 -t UTF-8 < Liberty_Infinity.txt > Liberty_Infinity_UTF.txt` and then `cat Liberty_Infinity_UTF.txt | sed -E -e 's/ ([^ "]*|"[^"]*")/;\1/g'`! Thanks a LOT for your help, patience and time! (Y) – Lucas Rezende Feb 22 '13 at 20:06

Ed Morton · Accepted Answer · 2013-02-22T22:00:14.447

2

This will work no matter how many lines are in your file as it just processes them one at a time:

awk 'BEGIN{FS=OFS="\""} {for (i=1;i<NF;i+=2) gsub(/ /,";",$i)} 1' file

e.g.:

$ cat file
"Mes_Referencia" "Num_Telefone" "Dsc_Plano_Tarifario" "Grupo Plano"
201211 2183223350 "INFINITY PR?" "PLANO INFINITY"
201211 2169175232 "INFINITY PR?" "PLANO INFINITY"
201211 2182676211 "INFINITY PR?" "PLANO INFINITY"
201211 2281699337 "INFINITY PR?" "PLANO INFINITY"
201211 2179173096 "INFINITY PR?" "PLANO INFINITY"
$
$ awk 'BEGIN{FS=OFS="\""} {for (i=1;i<NF;i+=2) gsub(/ /,";",$i)} 1' file
"Mes_Referencia";"Num_Telefone";"Dsc_Plano_Tarifario";"Grupo Plano"
201211;2183223350;"INFINITY PR?";"PLANO INFINITY"
201211;2169175232;"INFINITY PR?";"PLANO INFINITY"
201211;2182676211;"INFINITY PR?";"PLANO INFINITY"
201211;2281699337;"INFINITY PR?";"PLANO INFINITY"
201211;2179173096;"INFINITY PR?";"PLANO INFINITY"

edited Feb 22 '13 at 22:00

answered Feb 22 '13 at 19:39

Ed Morton

188,023
17
78
185

In this case I am afraid about the performance as this file has 70.051.095 Millions of lines! – Lucas Rezende Feb 22 '13 at 19:42
@LucasRezende - why do you think this will be slower than any other solution? – Ed Morton Feb 22 '13 at 21:58
@Fredrik - ah, thanks. I'll fix it to use `;`s. So used to seeing this as a CSV question! – Ed Morton Feb 22 '13 at 21:59
@LucasRezende - see measurements in my post. awk solution by Ed is actually the fastest solution here (even beats my sed). The accepted answer is 10 times slower! – Fredrik Pihl Feb 23 '13 at 12:45

Fredrik Pihl · Answer 3 · 2013-02-22T20:45:30.677

How about replacing the first 2 spaces with ; and every " " with ";"

$ sed 's/ /;/;s/ /;/;s/" "/";"/g' file
201211;001093223359;"PLANO ESPECIAL";"PLANO NOVO"
201211;001199175239;"PLANO ESPECIAL";"PLANO NOVO"
201211;001292676219;"PLANO ESPECIAL";"PLANO NOVO"

use the -i switch to make the changes inline.

some timings using a file with 30000003 lines:

$ time sed 's/ /;/;s/ /;/;s/" "/";"/g' f1 > /dev/null

real    1m58.305s
user    1m54.811s
sys 0m1.488s

$ time awk 'BEGIN{FS=OFS="\""} {for (i=1;i<NF;i+=2) gsub(/ /,",",$i)} 1' f1 > /dev/null

real    1m46.916s
user    1m45.831s
sys 0m0.852s


$ time sed -E -e 's/ ([^ "]*|"[^"]*")/;\1/g' f1 > /dev/null

real    20m52.172s
user    20m47.430s
sys 0m2.536s

i.e. BIG penalty for the gready operator and back referencing!

score 0 · Answer 4 · answered Feb 22 '13 at 18:50

0

awk should do the trick.

awk -v OFS=";" '{print $1,$2,$3" "$4,$5" "$6}'

answered Feb 22 '13 at 18:50

Matt LaFave

569
6
17

It should work, but the problem is that the string fields are not always like that. The number of spaces changes. For example, this could be "PLANO ESPECIAL NATAL" or "PLANO TARIFADO PROMOCAO FALE MAIS ILIMITADO". – Lucas Rezende Feb 22 '13 at 18:53
It must be something dynamic to ignore spaces between "". =/ – Lucas Rezende Feb 22 '13 at 18:56
I don't know exactly. It is the one month of calls in a telecom company! Too many options I guess :S – Lucas Rezende Feb 22 '13 at 18:57
I was thinking if it was possible to perform two awk commands in the same row. For example (it doesn't work!): head -30 file.txt | awk -F" " '{ print $1";"$2";"}' && awk -F"\"" '{ print $2 ..... }' – Lucas Rezende Feb 22 '13 at 18:58
Maybe mix sed with regex... sed "(space)" to ";" where "(space)" not between "". I just can't figure out if it is possible. – Lucas Rezende Feb 22 '13 at 19:02
This file has 70.051.095 Millions of lines! – Lucas Rezende Feb 22 '13 at 19:13

score 0 · Answer 5 · answered Feb 22 '13 at 20:46

0

Try:

awk 'NR%2{gsub(/[ \t]+/,";")}1' RS=\" ORS=\" file

answered Feb 22 '13 at 20:46

Scrutinizer

9,608
1
21
22

1

Nice approach but will add a newline and double quote to the end of the output. – Ed Morton Feb 22 '13 at 22:11
Good catch - Thanks.. It's only a double quote, no? – Scrutinizer Feb 22 '13 at 22:36
I think what it would add depends on whether or not the last line in the file ends with a " and whether it's an even or odd number lines might play into it too. The point is just that you'd need to tweak the script a bit to get it to never add something... In GNU awk IIRC you usually do it by testing RT. – Ed Morton Feb 22 '13 at 22:44
I could not find a difference between even and odd lines. What do you mean with RT an IIRC? A quick possible fix that comes to mind: `awk 'END{printf "\n"} NR%2{gsub(/[ \t]+/,";")}NF' RS=\" ORS=\" file` but then a regular for loop might be more straight forward.. – Scrutinizer Feb 22 '13 at 22:57
1

IIRC is "If I Remember Correctly". RT in gawk is the Record Terminator - that's the string that matched your RS regexp for the current record so it won't be set for the newline after the final " in the input file since there's no RS after it. Try `awk 'NR%2{gsub(/[ \t]+/,";")}RT; END{printf "\n"}' RS=\" ORS=\" file` but I think there's probably a more concise script possible using this same approach. – Ed Morton Feb 23 '13 at 02:01

Delimiter substitution excluding character inside string column

5 Answers5