1

My regex didn't work in a csv file with awk on its command line field separator.

My csv is separated by commas (,) but some fields has commas inside itself too.

The data.csv is like:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
field without comma,f22,f23,f34

If we see in field, with comma,f12,f13,f14, we have two kinds of commas:

  1. comma is part of the data (inside in the field), like field, with comma, and;
  2. comma is separating fields,f12,f13,f14.

So I tried awk, with -F and regex:

awk -F'/\B\,/\B/' '!seen[$2]++' data.csv > resulted.csv

My strategy was: the field separator needs to be a comma \, in No-Word-Boundary \B.

So, my command didn't outputted the resulted.csv. But outputted a warning:

gawk: warning: escape sequence `\B' treated as plain `B'
gawk: warning: escape sequence `\,' treated as plain `,'

And the desired result.csv will remove repeated lines, like:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
paulolimac
  • 113
  • 6
  • 5
    That's not a valid CSV file. If a field contains comma, it needs to be enclosed in quotes. Otherwise there's no way to tell that the comma is inside the field rather than a delimiter. – Barmar Aug 15 '19 at 21:37
  • @Barmar, I agree that is not a valid csv (needs a delimiter, separator, escaping the chars in field contents etc). But that is the data, as is. Even so, look at my observation about the 2 kind of commas. If the _comma has boundaring 2 words in its left and right_ (word= char + digit + underline) then it is a _separator_; if the comma has boundaring only one word then it is a piece of content. My problem is the regex didnt escape the `\B` nor `\,`. – paulolimac Aug 15 '19 at 22:28
  • 1
    `uniq data.csv` ? – Cyrus Aug 15 '19 at 22:31
  • @Cyrus, I can't use `uniq` here because all lines are unique. Awk here needs to: separate the columns by regex (`-F`), get one column to be used as a key (`$2`), filter each of them (`!seen`) and, output all. – paulolimac Aug 15 '19 at 22:50
  • 2
    `awk` doesn't use PCRE, it doesn't support many of the escape sequences. – Barmar Aug 15 '19 at 23:40
  • @paulolimac [you say your lines are all unique](https://stackoverflow.com/questions/57516519/how-to-separate-csv-collumns-by-awk-with-a-comma-being-the-field-separator#comment101501980_57516519) but they aren't in your example so fix it to be more truly representative of your real data. Ypu alo seem to have decided on a specific approach that may not be the best, and you're talking about using $2 as a key when doing so with your sample input wouldn't produce your expected output, etc. Please fix your question to provide a more truly realistic example we can test a potential solution against. – Ed Morton Aug 16 '19 at 00:54
  • This is invalid CSV as indicated by the comment of @Barmar. Imagine you have a line `field,with comma,f12,f13,f14`, and for some reason the space is gone due to a programing inconsistency. So you do not know of the field is `field,with comma` or `with comma,f12` or `f12,f13`. As indicated earlier, your line should read `"field, with comma",f12,f13,f14`. If that is the case, you can now easily use https://stackoverflow.com/questions/45420535 – kvantour Aug 16 '19 at 09:28
  • @EdMorton thanks for your comment. You're right about the unique entries, so I updated the lines to be unique now. Unfortunately I can't put my real data here nor mock them. – paulolimac Aug 20 '19 at 02:38
  • @kvantour, yes, my csv is invalid because it doesn't have delimiters (aka quotes). But it is workable because of the comma-words boundaries (eg. have whitespaces). – paulolimac Aug 20 '19 at 02:41

3 Answers3

1

With GNU awk:

awk -F ',[^ ]' '!seen[$2]++' data.csv

Output:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • Why GNU awk? Seems like valid generic awk. – kvantour Aug 16 '19 at 09:25
  • Tested with GNU awk. – Cyrus Aug 16 '19 at 13:16
  • @Cyrus, thanks for your answer, it woked on my dataset. Luckily, any data have a special boundary (eg. special chars) and with only white-spaces it worked. Thanks again. – paulolimac Aug 20 '19 at 02:01
  • @kvantour I'm using gawk in my question (see the warranty messages). I don't know if are differences (incompatibilities) between [different implementations](https://en.wikipedia.org/wiki/AWK#Versions_and_implementations). Specially about in regex terms. – paulolimac Aug 20 '19 at 02:09
1

Without GNU awk, with your data, you can use gsub to replace the ", " string with some non-conflicting characters such as "__" separate the fields as normal on "," and then restore the comma within the field (e.g. ", ") using gsub again. For example:

 awk -F, -v OFS=, '
    { gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
    !seen[$0]++
' file.csv

Above gsub(/, /,"__") replaces all occurrences of ", " with two-underscores in the input record. Then looping over each field, any "__" is replaced with ", " restoring the original comma in the field.

Example Use/Output

Given your data, the above results in:

$ awk -F, -v OFS=, '
>     { gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
>     !seen[$0]++
> ' file.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • in the general case, we would need to use a substitution value (here its `"__"`) that is guaranteed not to appear in the data. consider the result given csv line like this: `field__without__comma,2,3,4` – spencer7593 Aug 16 '19 at 00:35
  • Yes, thank you, I though it was clear *"with your data"* and *"replace the `", "` string with some non-conflicting characters"*. This wasn't intended as a fix-all to be blindly applied to all cases. – David C. Rankin Aug 16 '19 at 00:37
  • my apologies. i glazed right over "*non-conflicting*" as meaning that we need to carefully select a special character sequence. special in the sense that the sequence is guaranteed not to appear in the data. its worth highlighting this point for future readers that come across this answer. +10 – spencer7593 Aug 16 '19 at 00:41
  • 2
    FWIW I usually use `RS` as the "non-conflicting characters" since when RS is a character as in this case (or a non-regexp string( it can't be present in the current record. – Ed Morton Aug 16 '19 at 00:51
  • You're welcome. For other situations, have you seen the trick of how to create a string that can't be present in the input? I use it with sed at https://stackoverflow.com/a/35708616/1745001. – Ed Morton Aug 16 '19 at 01:25
  • Now you have peaked my curiosity once again. If `RS` covers each record, what could cover the entire input? The *nul-character*? (grasshopper eagerly waits to snatch additional pebbles from the master's hand `:)` Found the link -- thank you. – David C. Rankin Aug 16 '19 at 01:28
  • Unfortunately there can be NULs in files and GNU awk at least allows you to treat them just like any other character so there really is no character that can't appear in a file. Here's how you handle that - in general `gsub(/@/,"@A")` means that `@B` (or `@`) cannot exist in the input afterwards so then you can do `gsub(/, /,"@B")` and you will KNOW that `@B` was a string that does not exist in the input at that time. Then after you do whatever you need to do you just unwind the gsubs with `gsub(/@B/,", "); gsub(/@A/,"@")` to get the original chars/strings back. – Ed Morton Aug 16 '19 at 01:35
  • See https://stackoverflow.com/a/44834014/1745001 for another example where I use that approach, this time in awk. – Ed Morton Aug 16 '19 at 01:38
  • Got you. I was digesting the `/a/aA/` and `/#/aC/` but was having trouble understanding the `/__/aB/` since that potentially was a 2-for-2 swap. The additional explanation makes is clear. You create the uniqe *following-character* and then can be assured the character and anything else but the *following character* cannot exist in the input. – David C. Rankin Aug 16 '19 at 01:38
  • Right. This also makes it possible to do non-greedy matching with string (ie not just char) delimiters because to get the string `abc` between the inner `foo` and `bar` in `foo abc bar def bar` you can do: `gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C"); gsub(/foo/,"{"); gsub(/bar/,"}"); match($0,/{[^{}]+})...` and unwind the gsubs after the match. It's lengthy but robust and simple once you understand the approach. – Ed Morton Aug 16 '19 at 01:41
  • 1
    Makes sense, you eliminate what you will use to bracket the non-greedy selection, (e.g. `'{'` and `'}'`), then you set the beginnings and ends for the range of your search ending up with `"{ abc } def }"` allowing for a `match` of what is between the brackets `{ ... }` resulting a a non-greedy match. Glad I was shown the multi-level thought process there, it would have not have been apparent without much more experience. – David C. Rankin Aug 16 '19 at 01:52
  • @DavidC.Rankin thanks for your answer. The substitution is a nice workaround too, plus all the points of interest commented here. Thank you. – paulolimac Aug 20 '19 at 02:54
  • 1
    Your welcome, but the credit for the deep-thinking goes to @EdMorton here. I wager I learned as much through the comments as I hopefully helped others with the answer. – David C. Rankin Aug 20 '19 at 03:22
1

If the intent is to use the t2 column as a key value then this is how you'd do it:

$ awk -F, '!seen[$(NF-2)]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24

If it's to use the t1 column as the key instead then this is how you'd do that:

$ awk '{key=$0; sub(/(,[^,]+){3}$/,"",key)} !seen[key]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14

If it's something else then please clarify your question and update the example.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    I selected [this answer](https://stackoverflow.com/a/57516993/1467021) because it resolved in my example and real datasets. And was more simple. But I liked your approach when using reverse NF. It worked too. Thanks for your contribution. – paulolimac Aug 20 '19 at 02:24