13

I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk, to convert these? I could use sed, like this, but am worried it will make some mistakes:

sed 's/\t/,/g' file.tsv > file.csv
  • Quotes needn't be added.

How can I convert a TSV to a CSV?

mklement0
  • 382,024
  • 64
  • 607
  • 775
Village
  • 22,513
  • 46
  • 122
  • 163
  • 2
    If tabs/commas inside of quoted strings are to be ignored, this is a bit more complex. – Jonathon Reinhart Mar 15 '14 at 05:30
  • The original file doesn't use quoted strings and the output file does not need them added either. – Village Mar 15 '14 at 05:32
  • @Village `tr` would be the right tool for the job as stated but what kind of mistakes are you worried sed might make? Can you update your question to show some sample input that you think sed might not handle correctly? – Ed Morton Mar 15 '14 at 10:07

5 Answers5

33

Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk-based solution.


To summarize the options (interestingly, they all perform about the same):

tr:

devnull's solution (provided in a comment on the question) is the simplest:

tr '\t' ',' < file.tsv > file.csv

sed:

The OP's own sed solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t chars.):

sed 's/\t/,/g' file.tsv > file.csv

The only caveat is that on some platforms (e.g., macOS) the escape sequence \t is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'):

sed 's/'$'\t''/,/g' file.tsv > file.csv

awk:

The caveat with awk is that FS - the input field separator - must be set to \t explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,:

awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv

Note that simply assigning $1 to itself causes awk to rebuild the input line using OFS - the output field separator; this effectively replaces all \t chars. with , chars. print then simply prints the rebuilt line.


Robust awk solution:

As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain , characters correctly - you'll end up with extra CSV fields.

The following awk solution fixes this, by enclosing such fields in "..." on demand (see the non-robust awk solution above for a partial explanation of the approach).

If such fields also have embedded " chars., these are escaped as "", in line with RFC 4180.Thanks, Wyatt Israel.

awk 'BEGIN { FS="\t"; OFS="," } {
  rebuilt=0
  for(i=1; i<=NF; ++i) {
    if ($i ~ /,/ && $i !~ /^".*"$/) { 
      gsub("\"", "\"\"", $i)
      $i = "\"" $i "\""
      rebuilt=1 
    }
  }
  if (!rebuilt) { $1=$1 }
  print
}' file.tsv > file.csv
  • $i ~ /[,"]/ && $i !~ /^".*"$/ detects any field that contains , and/or " and isn't already enclosed in double quotes

  • gsub("\"", "\"\"", $i) escapes embedded " chars. by doubling them

  • $i = "\"" $i "\"" updates the result by enclosing it in double quotes

  • As stated before, updating any field causes awk to rebuild the line from the fields with the OFS value, i.e., , in this case, which amounts to the effective TSV -> CSV conversion; flag rebuilt is used to ensure that each input record is rebuilt at least once.

Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • wrt `The caveat with awk is that FS - the input field separator - must be set to \t explicitly` - that's no more a caveat for awk than it is for `tr` or `sed`. In all 3 tools you need to specify the input field separator and what you want it converted to and if you don't do that then obviously you won't get the behavior you want. – Ed Morton Mar 15 '14 at 10:02
  • 1
    @EdMorton: Since `\t` is _among_ the characters that `awk` treats as input field separators _by default_, one might be led to assume that setting the input field separator is not necessary here - which is a fallacy for the reasons I pointed out; hence the caveat. As for `tr` and `sed`: the concept of an input field separator is not applicable. – mklement0 Mar 15 '14 at 10:11
  • When I said `input field separator` I just meant `the character that separates your input into fields (values)` which applies to the stated input format (`tab-separated-values`), not any specific tool. I just don't see having to specify that for awk to tell it how to behave as desired as any more of a caveat than specifying it for the other tools - it's exactly the same concept in all 3 and if you don't do it then you won't get the desired behavior. Anyway, `tr` is the right tool for the job as stated so it's maybe a moot point. – Ed Morton Mar 15 '14 at 10:28
2

This can also be achieved with Perl:

In order to pipe the results to a new output file you can use the following:
perl -wnlp -e 's/\t/,/g;' input_file.tsv > output_file.csv

If you'd like to edit the file in place, you can invoke the -i option:
perl -wnlpi -e 's/\t/,/g;' input_file.txt

If by some chance you find that what you are dealing with is not actually tabs, but instead multiple spaces, you can use the following to replace each occurrence of two or more spaces with a comma:
perl -wnlpi -e 's/\s+/,/g;' input_file

Keep in mind that \s represents any whitespace character, including spaces, tabs or newlines and cannot be used in the replacement string.

Toby
  • 245
  • 1
  • 10
  • You could also use vim. Just use the following search and replace from command mode: `:%s/\t/,/g` This allows you to immediately view your results and undo them with a single button press (u) if needed. – Toby Oct 08 '14 at 00:02
  • 1
    If you're already using Perl, you might as well also use https://metacpan.org/pod/Text::CSV – Robert Jul 18 '18 at 15:40
1

Using awk works for me

converting tsv to csv

awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv

or converting csv to tsv

awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv
wpmoradi
  • 51
  • 3
0

The tr command :

tr '\t' ',' < file.tsv > file.csv

is simple and gave absolutely correct and very quick results for me even on a really large file (approx 10 GB).

Pranav
  • 27
  • 1
  • 6
0

You can simply use the power of sed in shell:

sed -r 's/\t/","/g' file.tsv|sed -r 's/(^|$)/"/g' > file.csv

In general, the above command turns Your tsv file into csv. However the tsv file may contain numerical fields. in this case, they shouldn't be surrounded by " like "123456". So we need another phase by which such double quotes are removed. The final solution:

sed -r 's/\t/","/g' file.tsv|sed -r 's/(^|$)/"/g'|sed -r 's/"([0-9]+)"/\1/g' > file.csv
Mehdi Hamidi
  • 99
  • 1
  • 8