3

I'm trying to split multiple comma-separated values into rows.

I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.

An example of input that I need to split:

CHROM  POS REF  ALT   GT_00  C_00  D_OO  E_00 F_00  GT_11 
 chr1  10   T    A     2,2   1,1   0,1   1,2   1,0   2
 chr1  10   T    G     3      2     1     2     0    0

The expected output:

chr1  10    T    A     2      1     0     1     1   2
chr1  10    T    A     2      1     1     2     0   2
chr1  10    T    G     3      2     1     2     0   0

I have done it with the following code:

 awk 'BEGIN{FS=OFS="\t"}
  {
    j=split($5,a,",");split($6,b,",");
    split($7,c,",");split($8,d,",");split($9,e,",");
    for(i=1;i<=j;++i)
    {
      $5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
    }}'

But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.

Is there a way to it using iteration?

Note: I need to do it in bash (not MySQL, SQL, python...)

kvantour
  • 25,269
  • 4
  • 47
  • 72
Marta_ma
  • 95
  • 1
  • 9
  • Line 2, column 5 should be 2? – dibery Aug 29 '19 at 07:53
  • @dibery yes, i was mistaken, I have just edited. Thanks! – Marta_ma Aug 29 '19 at 07:56
  • Could there be more like this `2,3,4` or `2,34` – Jotne Aug 29 '19 at 07:58
  • 1
    @Jotne Both cases, but, considering one row, the number of comma separated values in a cell will be the same. It is not possible to have, in the same row, a column with `1,2,3` and another with `1,2` – Marta_ma Aug 29 '19 at 08:03
  • So you can have `1,2,3`? On you first line `2,2 1,1 0,1 1,2 1,0 2` you have a mix of number with comma and without. If there can be more than one comma edit your post with example. I will complicate a lot, since we then have to test all fields on how many comma there may be. – Jotne Aug 29 '19 at 08:23
  • 1
    @kvantour in OP's input only a range of fields needs to be splitted, and he/she says these fields all have the same number of commas in them – oguz ismail Aug 30 '19 at 07:48

2 Answers2

1

This awk may do:

file:

chr1    10      T       A       2,2     1,1     0,1     1,2     1,0     2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1,2,3   4,2,1   7       1,8,3   3
chr1    10      T       D       1,2,3,5 4,2,1,8 1,8,3,2 3       5       7

Solution:

awk '{
    n=0;
    for(i=5;i<=NF;i++) {
        t=split($i,a,",");if(t>n) n=t};
    for(j=1;j<=n;j++) {
        printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
        for(i=5;i<=NF;i++) {
            split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
            };
        print ""
        }
    }' file
chr1    10      T       A       2       1       0       1       1       2
chr1    10      T       A       2       1       1       2       1       2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1       4       7       1       3
chr1    10      T       C       5       2       2       7       8       3
chr1    10      T       C       5       3       1       7       3       3
chr1    10      T       D       1       4       1       3       5       7
chr1    10      T       D       2       2       8       3       5       7
chr1    10      T       D       3       1       3       3       5       7
chr1    10      T       D       5       8       2       3       5       7

Your test input gives:

chr1    10      T       A       2       1       0       1       1       2
chr1    10      T       A       2       1       1       2       1       2
chr1    10      T       G       3       2       1       2       0       0

It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/198714/discussion-on-answer-by-jotne-how-can-i-split-comma-separated-values-into-multip). – Samuel Liew Aug 30 '19 at 09:17
1

Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:

awk '
BEGIN { OFS="\t" }
     { $1=$1;t=$0; }
     { while(index($0,",")) {
         gsub(/,[[:alnum:],]*/,""); print;
         $0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
       }
       print t
     }' file

how does it work: The idea is based on 2 types of substitutions:

  1. gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
  2. gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4

So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]

input:

chr1    10      T       A       2,2     1,1     0,1     1,2     1,0     2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1,2,3   4,2,1   7       1,8,3   3
chr1    10      T       D       1,2,3,5 4,2,1,8 1,8,3,2 3       5       7

output:

chr1    10  T   A   2   1   0   1   1   2
chr1    10  T   A   2   1   1   2   0   2
chr1    10  T   G   3   2   1   2   0   0
chr1    10  T   C   5   1   4   7   1   3
chr1    10  T   C   5   2   2   7   8   3
chr1    10  T   C   5   3   1   7   3   3
chr1    10  T   D   1   4   1   3   5   7
chr1    10  T   D   2   2   8   3   5   7
chr1    10  T   D   3   1   3   3   5   7
chr1    10  T   D   5   8   2   3   5   7
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • 1
    Thanks for these different approach! @kvantour Which of these three different solutions is faster? In my real dataset I have hundreds of millions of rows and 130 colums – Marta_ma Aug 30 '19 at 09:09
  • @Marta_ma here is a combined answer to all your questions: **(A)** yes the script considers numbers bigger than 9, but not floating-point numbers. **(B)** comma-separated strings with letters or numbers are handled equally. So yes the third column can have columns **(C)** this method will be faster than the split method as a single `gsub` beats `NF-5` times `split`. – kvantour Aug 30 '19 at 09:38
  • It doesn't seem to work properly. I think because I don't have a single letter in the third column, in some cases I have the word `None`in that field – Marta_ma Aug 30 '19 at 09:44
  • It happens when I have a row like this: `chr1 11 None,G None,T 2 1 0 3 2 2` – Marta_ma Aug 30 '19 at 10:18
  • Problem solved! I'm used to write the FS like `awk -F"\t"`. I have changed it but I forgot to add at the end `OFS="\t"` – Marta_ma Aug 30 '19 at 10:24
  • @Marta_ma Happy to hear that. if a tab is your initial field separator, you can use `BEGIN{FS=OFS="\t"}` – kvantour Aug 30 '19 at 10:27
  • I'll try both options with my real data and choose the faster one! :) – Marta_ma Aug 30 '19 at 10:30