0

I've currently been trying to create an awk script that will take my .csv and convert it into a .tsv. I have managed to achieve this with the below script:

$ awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv

Example csv data

"visitor_date","country","browser","browser_version","id1","id2","id3","id4","true or false column"
"05/10/2019 00:55","United States","App","","692467","2163702","4117512086","format","0"
"05/10/2019 00:56","United States","SamsungBrowser","8","692467","2163702","4117512083","format","1"
"05/10/2019 01:08","United States","safari","11","692467","2163704","4117511299","format","1"
"05/10/2019 03:25","United States","SamsungBrowser","8","692467","2163704","4117511299","format",""
"05/10/2019 07:50","United States","chrome","58","692467","2163704","4117511299","format",""
"05/10/2019 09:18","United States","internet explorer","11","692467","2163702","4117512086","format","0"

However I've found that my data file tends to have all of the columns defined as a sting with the "", even when we have things that might be considered an integer.

My initial test was to try and remove the {","} from everything and then define things as strings on certain columns but I get an error when I try to quote it.

$ awk 'BEGIN { FS="",""; OFS="\t" } {$1=$1; print}' file.csv > file.tsv

Error

awk: cmd. line:1: BEGIN { FS="",""; OFS="\t" } {$1=$1; print}
awk: cmd. line:1:              ^ syntax error

From the research I've done online all I seem to find is how you can change a string value in a cell to a different integer value. Is there a way that I can transform certain columns (not cells) to be treated as "integers" and "dates" instead of all as strings? Is what I'm trying to do with my initial test even the right direction I should be taking or is there a different way with AWK to decide what my columns are?

Desired tsv Output

"visitor_date"  "country    browser"    "browser_version"   "client_transaction_id" "id1"   "id2"   "id3"   "id4"   "true or false column"
05/10/2019 00:55    "United States" "App"       null    692467  "2163702"   4117512086  "format"    0
05/10/2019 00:56    "United States" "SamsungBrowser"    8   "null"  692467  2163702 4117512083  "format"    1
05/10/2019 01:08    "United States" "safari"    11  null    "692467"    2163704 4117511299  "format"    1
05/10/2019 03:25    "United States" "SamsungBrowser"    8   "null"  692467  2163704 4117511299  "format"    
05/10/2019 07:50    "United States" "chrome"    58  null    "692467"    2163704 4117511299  "format"    
05/10/2019 09:18    "United States" "internet explorer" 11  "null"  692467  2163702 4117512086  "format"    0

Note: The "visitor_date" column would need to be a "date" format.

Apologies if this is something super obvious or basic. I've currently only got 2 days experience using an ubuntu interface on windows 10 and AWK so there is a lot I'm trying to take in.

tripleee
  • 175,061
  • 34
  • 275
  • 318
El_Birdo
  • 315
  • 4
  • 19
  • 1
    Please do post your sample expected output in your post and let us know then. – RavinderSingh13 Oct 30 '19 at 14:30
  • 1
    see here: https://unix.stackexchange.com/questions/359832/converting-csv-to-tsv – gregory Oct 30 '19 at 14:40
  • 2
    You would like to remove all quotes (see title) but you do not remove all quotes (see example output). Please explain. – kvantour Oct 30 '19 at 15:19
  • Don't use `awk` to process a CSV file; use a proper CSV parser that knows why the quotes are there and will handle quoted strings (which could contain commas) properly. – chepner Oct 30 '19 at 15:32
  • 1
    It's **extremely** unclear what it is you're trying to do. You say you want to `Remove quotes “…” from all columns` but then your posted expected output doesn't have all quotes removed. You have some `null` strings in your expected output that are within quotes and some that aren't. You talk about converting strings to integers and needing to display a "date format" for some case, etc. Please take a minute to figure out exactly ewhat it is you're trying to do and then update your question to tell us what that is with an example that clearly demonstrates your requirements. – Ed Morton Oct 30 '19 at 15:52
  • In general to work with CSVs in awk see [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – Ed Morton Oct 30 '19 at 15:53
  • 1
    @EdMorton apologies - I agree with what you are saying. I think I was spending so much time focusing on getting to the final "desired" format, where what I should have focused on was getting the file "clean" so it could be picked up and defined in an sql server. Based on that I found an answer that resolved the issue. I'll work to be clearer and more focused in future :) – El_Birdo Oct 31 '19 at 14:57

2 Answers2

2

If there are no double quotes you want to preserve, simply

awk 'BEGIN { FS=","; OFS="\t" } { gsub("\"", "") } { $1=$1 } 1' file.csv >file.tsv

Incidentally, this also demonstrates how to quote a literal double-quote character inside a double-quoted string; backslash-escape it.

If you only want to remove double quotes around some columns, you can limit the scope of the gsub. For example,

... { gsub("\"", "", $1); gsub("\"", "", $3) } 1

to only manipulate columns 1 and 3. Then you can drop the { $1 = $1 } because we are forcing Awk to recalculate its columns anyway (so the FS gets replaced with OFS).

To not manipulate the first line (which commonly contains the header, not data), add a condition:

... NR>1 { gsub(...) }1

The "replace everywhere" can be expressed quite succinctly in sed, though it tends to be more of a write-only language.

sed -e 's/,/\t/g' -e 's/"//g' file.csv >file.tsv

Not all sed dialects interpret \t as a literal tab, and some are even allergic to multiple -e options; but this should work fine on Ubuntu and generally Linux (other than possibly Busybox and other similar attempts to relive the constraints of the jolly 1970s).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    Then again, if the first case is to remove _all_ doublequotes, I'd suggest passing through `tr -d '"'` before going to awk at all. – kojiro Oct 30 '19 at 14:54
  • 1
    thanks, I've tried the awk suggestion but the output tsv still contains the "," separators. I tried running: awk 'BEGIN { FS=","; OFS="\t" } { gsub("\"", "") } 1' file.csv > file.tsv | awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.tsv but that seems to not remove the , and change it to a \t. – El_Birdo Oct 30 '19 at 14:55
  • 1
    @kojiro I tried your suggestion in a couple of ways: 1. tr -d '"' file.csv | awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv port_file.csv > file.tsv tr: extra operand ‘file.csv’ Only one string may be given when deleting without squeezing repeats. 2. tr -d '"' file.csv > file.tsv tr: extra operand ‘file.csv’ Only one string may be given when deleting without squeezing repeats. 3. tr -d '"' file.csv tr: extra operand ‘file.csv’ Only one string may be given when deleting without squeezing repeats. Any idea on my error? – El_Birdo Oct 30 '19 at 15:01
  • 3
    `tr -d '"' < file.csv | tr ',' '\t' > file.tsv`? Notably this will only work if the input is a very clean csv that didn't need the quotes in the first place, because there were no commas in the field values anyway. – kojiro Oct 30 '19 at 15:24
  • @kojiro Unfortunately the OP updated their question so `tr` is out anyway. – tripleee Oct 30 '19 at 18:36
  • @El_Birdo See updated answer now. I was lazy and assumed the `$1=$1` would not be necessary, but it is if you don't `gsub(...)` on individual fields. As per your updated requirements it seems like you want to `gsub` on individual fields and then you can drop the `$1 = $1` after all. – tripleee Oct 30 '19 at 18:40
  • @tripleee that's understandable. The number of real-world csvs that are as clean as you'd need to use tr to convert them is pretty small. – kojiro Oct 30 '19 at 19:33
  • @tripleee thanks for the input. I managed to get the first suggestion to work. Alternatively I also found I could make the following work: cat file.csv | tr ',' ' ' | tr -d '"' (and then covert etc). – El_Birdo Oct 31 '19 at 14:54
  • 1
    You might want to throw a `gsub(OFS, " ")` in there as the first step just to be safe. – Ed Morton Oct 31 '19 at 15:27
0

It's not awk, but it's very easy using Miller (https://github.com/johnkerl/miller). Running

mlr --c2t cat ./input.csv >./output.csv

You will have

visitor_date    country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55    United States   App     692467  2163702 4117512086  format  0
05/10/2019 00:56    United States   SamsungBrowser  8   692467  2163702 4117512083  format  1
05/10/2019 01:08    United States   safari  11  692467  2163704 4117511299  format  1
05/10/2019 03:25    United States   SamsungBrowser  8   692467  2163704 4117511299  format  
05/10/2019 07:50    United States   chrome  58  692467  2163704 4117511299  format  
05/10/2019 09:18    United States   internet explorer   11  692467  2163702 4117512086  format  0
aborruso
  • 4,938
  • 3
  • 23
  • 40