20

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).

inputFile.txt Tab delimited text file

header  62  9   3   54  6   1
25  1   2   3   4   5   6
96  1   1   1   1   0   1
72  3   3   3   3   3   3
18  0   1   0   1   1   0
82  1   0   0   0   0   1
77  1   0   1   0   1   1
15  7   7   7   7   7   7
82  0   0   1   1   1   0
37  0   1   0   0   1   0
18  0   1   0   0   1   0
53  0   0   1   0   0   0
57  1   1   1   1   1   1

subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.

1,4,6

subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.

1,3,7

Current solution using cut and awk loop (Related post: Select rows using awk):

# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt

# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput

Output file: result.txt

1   4   6
3   3   3
7   7   7

Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.

Any better way?

Real input files info:

# $fileInput:
#        Rows = 20127
#        Cols = 533633
#        Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers

More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like .

System:

$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7

Community
  • 1
  • 1
zx8754
  • 52,746
  • 12
  • 114
  • 209

5 Answers5

22

Even though in If programming languages were countries, which country would each language represent? they say that...

Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.

... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!

So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.

awk -v cols="1 4 6" -v rows="1 3 7" '
    BEGIN{
       split(cols,c); for (i in c) col[c[i]]  # extract cols to print
       split(rows,r); for (i in r) row[r[i]]  # extract rows to print
    }
    (NR-1 in row){
       for (i=2;i<=NF;i++) 
              (i-1) in col && line=(line ? line OFS $i : $i); # pick columns
              print line; line=""                             # print them
    }' file

With your sample file:

$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7

With your sample file, and inputs as variables, split on comma:

awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput

I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.

Braiam
  • 1
  • 11
  • 47
  • 78
fedorqui
  • 275,237
  • 103
  • 548
  • 598
6

One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:

$ awk '
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
}
FILENAME==ARGV[1] {                       # process rows file
    n=split($0,t,","); 
    for(i=1;i<=n;i++) r[t[i]]
} 
FILENAME==ARGV[2] {                       # process cols file
    m=split($0,t,","); 
    for(i=1;i<=m;i++) c[t[i]]
} 
FILENAME==ARGV[3] && ((FNR-1) in r) {     # process data file
    for(i in c) 
        printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt   
1 4 6
3 3 3
7 7 7

Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • Not really. Kind of interested to hear the performance if you're going to test it, though. Also, I wrote in a couple of pieces in a couple of meetings so I'm just glad to hear if it works... QC and stuff. :D – James Brown Nov 28 '16 at 12:25
  • 1
    Note, however, that using `for (i in c)` may not provide the correct order: [Why does awk seem to randomize the array?](http://stackoverflow.com/a/22504503/1983854) – fedorqui Nov 28 '16 at 12:55
  • You were right, the rows and cols were mixed up, sorry about that. Also, `OFS` is space now. – James Brown Nov 28 '16 at 12:55
  • 2
    @JamesBrown good one! Then +1, well done! This has to be faster than my solution, since it does not go through all columns checking if they have to be printed or not; instead, it just picks up them. – fedorqui Nov 28 '16 at 12:57
  • I skip first row and col in `((FNR-1) in r)` and `$(i+1)`. – James Brown Nov 28 '16 at 13:12
  • 2
    By the way, James: shouldn't `PROCINFO["sorted_in"]="@ind_num_asc";` be set in the `BEGIN` block? I am curious if setting it just when reading the first file may affect the others. – fedorqui Nov 28 '16 at 13:17
  • Can you please post another sample of data where row count != column count. I'm kind of curious why it doesn't work. – James Brown Nov 28 '16 at 14:52
  • Funny, I'm getting the same result on my solution over here as is @fedorqui's in the question. You are using Gnu awk, right? The solution depends on it as I'm using `PROCINFO["sorted_in"]="@ind_num_asc"` to order the columns output. – James Brown Nov 28 '16 at 15:38
  • 2
    Apparently that functionality is only available in later versions, 4.0 was mentioned somewhere. – James Brown Nov 28 '16 at 15:49
2

Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.

awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
   n = split(cols, c, /,/)
   split(rows, r, /,/)
   for (i in r)
      row[r[i]]
}
(NR-1) in row {
   for (i=1; i<=n; i++)
      printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt

PS: This should work with older awk version or non-gnu awk as well.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    Oh, cool, you use the return number of split to know the amount of columns to check and hence use it when extracting the proper column numbers. This is very smart, well done. – fedorqui Nov 28 '16 at 20:07
0

to refine @anubhava solution we can get rid of searching over 10k values for each row to see if we are on the right row by takeing advantage of the fact the input is already sorted

awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
   n = split(cols, c, /,/)
   split(rows, r, /,/)
   j=1;
}
(NR-1) == r[j] { 
   j++
   for (i=1; i<=n; i++)
      printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
tomc
  • 1,146
  • 6
  • 10
  • 2
    I don't know much about awk to say if this post deserves to be a separate answer, for me it seems changes are too minor and could have been a comment to @anubhava 's post? – zx8754 Nov 28 '16 at 20:39
  • perhaps you should time it at scale – tomc Nov 28 '16 at 20:40
-1

Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.

This should slice columns 20,000 to 30,000.

import csv
with open('foo.txt') as f:
    gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
    for row in gwas:
        print(row[20001:30001]
RonJohn
  • 349
  • 8
  • 20