How can I split comma separated values into multiple rows?

Question

I'm trying to split multiple comma-separated values into rows.

I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.

An example of input that I need to split:

CHROM  POS REF  ALT   GT_00  C_00  D_OO  E_00 F_00  GT_11 
 chr1  10   T    A     2,2   1,1   0,1   1,2   1,0   2
 chr1  10   T    G     3      2     1     2     0    0

The expected output:

chr1  10    T    A     2      1     0     1     1   2
chr1  10    T    A     2      1     1     2     0   2
chr1  10    T    G     3      2     1     2     0   0

I have done it with the following code:

 awk 'BEGIN{FS=OFS="\t"}
  {
    j=split($5,a,",");split($6,b,",");
    split($7,c,",");split($8,d,",");split($9,e,",");
    for(i=1;i<=j;++i)
    {
      $5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
    }}'

But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.

Is there a way to it using iteration?

Note: I need to do it in bash (not MySQL, SQL, python...)

@Jotne Both cases, but, considering one row, the number of comma separated values in a cell will be the same. It is not possible to have, in the same row, a column with `1,2,3` and another with `1,2` — Marta_ma, Aug 29 '19 at 08:03
So you can have `1,2,3`? On you first line `2,2 1,1 0,1 1,2 1,0 2` you have a mix of number with comma and without. If there can be more than one comma edit your post with example. I will complicate a lot, since we then have to test all fields on how many comma there may be. — Jotne, Aug 29 '19 at 08:23
@kvantour in OP's input only a range of fields needs to be splitted, and he/she says these fields all have the same number of commas in them — oguz ismail, Aug 30 '19 at 07:48

Jotne · Accepted Answer · 2019-08-30T07:29:04.363

This awk may do:

file:

chr1    10      T       A       2,2     1,1     0,1     1,2     1,0     2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1,2,3   4,2,1   7       1,8,3   3
chr1    10      T       D       1,2,3,5 4,2,1,8 1,8,3,2 3       5       7

Solution:

awk '{
    n=0;
    for(i=5;i<=NF;i++) {
        t=split($i,a,",");if(t>n) n=t};
    for(j=1;j<=n;j++) {
        printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
        for(i=5;i<=NF;i++) {
            split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
            };
        print ""
        }
    }' file
chr1    10      T       A       2       1       0       1       1       2
chr1    10      T       A       2       1       1       2       1       2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1       4       7       1       3
chr1    10      T       C       5       2       2       7       8       3
chr1    10      T       C       5       3       1       7       3       3
chr1    10      T       D       1       4       1       3       5       7
chr1    10      T       D       2       2       8       3       5       7
chr1    10      T       D       3       1       3       3       5       7
chr1    10      T       D       5       8       2       3       5       7

Your test input gives:

chr1    10      T       A       2       1       0       1       1       2
chr1    10      T       A       2       1       1       2       1       2
chr1    10      T       G       3       2       1       2       0       0

It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/198714/discussion-on-answer-by-jotne-how-can-i-split-comma-separated-values-into-multip). — Samuel Liew, Aug 30 '19 at 09:17

kvantour · Answer 2 · 2019-08-30T09:26:08.610

1

Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:

awk ' BEGIN { OFS="\t" }
     { $1=$1;t=$0; }
     { while(index($0,",")) {
         gsub(/,[[:alnum:],]*/,""); print;
         $0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
       }
       print t
     }' file

how does it work: The idea is based on 2 types of substitutions:

gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4

So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]

input:

chr1    10      T       A       2,2     1,1     0,1     1,2     1,0     2
chr1    10      T       G       3       2       1       2       0       0
chr1    10      T       C       5       1,2,3   4,2,1   7       1,8,3   3
chr1    10      T       D       1,2,3,5 4,2,1,8 1,8,3,2 3       5       7

output:

chr1    10  T   A   2   1   0   1   1   2
chr1    10  T   A   2   1   1   2   0   2
chr1    10  T   G   3   2   1   2   0   0
chr1    10  T   C   5   1   4   7   1   3
chr1    10  T   C   5   2   2   7   8   3
chr1    10  T   C   5   3   1   7   3   3
chr1    10  T   D   1   4   1   3   5   7
chr1    10  T   D   2   2   8   3   5   7
chr1    10  T   D   3   1   3   3   5   7
chr1    10  T   D   5   8   2   3   5   7

edited Aug 30 '19 at 09:26

answered Aug 30 '19 at 09:02

kvantour

25,269
4
47
72

1

Thanks for these different approach! @kvantour Which of these three different solutions is faster? In my real dataset I have hundreds of millions of rows and 130 colums – Marta_ma Aug 30 '19 at 09:09
@Marta_ma here is a combined answer to all your questions: **(A)** yes the script considers numbers bigger than 9, but not floating-point numbers. **(B)** comma-separated strings with letters or numbers are handled equally. So yes the third column can have columns **(C)** this method will be faster than the split method as a single `gsub` beats `NF-5` times `split`. – kvantour Aug 30 '19 at 09:38
It doesn't seem to work properly. I think because I don't have a single letter in the third column, in some cases I have the word `None`in that field – Marta_ma Aug 30 '19 at 09:44
It happens when I have a row like this: `chr1 11 None,G None,T 2 1 0 3 2 2` – Marta_ma Aug 30 '19 at 10:18
Problem solved! I'm used to write the FS like `awk -F"\t"`. I have changed it but I forgot to add at the end `OFS="\t"` – Marta_ma Aug 30 '19 at 10:24
@Marta_ma Happy to hear that. if a tab is your initial field separator, you can use `BEGIN{FS=OFS="\t"}` – kvantour Aug 30 '19 at 10:27
I'll try both options with my real data and choose the faster one! :) – Marta_ma Aug 30 '19 at 10:30

How can I split comma separated values into multiple rows?

2 Answers2

Linked