-2

I have a large text file like this:

#RefName    Pos Coverage
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   0   0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   1   0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   2   1
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   3   0
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   4   0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   5   0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   6   101
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   7   10
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS]   8   0

The first row is the header, which can be ignored or deleted. I have two separate goals:

1) I want to extract all rows with the value in the last column not 0. 2) I want to group by the first column, and in the grouped file: delete the 2nd column, and sum the last column.

I know how to do these in pandas, but the file is >10G, loading into pandas itself is painful.

Is there a clean way to do these? Like using bash or awk something?

Thank you!

stevex
  • 59
  • 1
  • 8
  • depending on your delimiter (whitespace?) pulling records !=0 is easy - deleting c2 is easy. What do you mean by "group by the first column"? Are we ignoring all the other columns, but keeping them in the output? Are they always the same? – Paul Hodges Nov 30 '18 at 21:54
  • I would use awk. This is not an appropriate question for SO. Please research the different ways to accomplish the task attempt a solution and post here if you run into trouble. –  Nov 30 '18 at 22:34

2 Answers2

0

A simple approach in vanilla python would be just reading in the file, and processing it according to a hardcoded format line-by-line:

sum_groups = 0
with open('groups_file.txt', 'w') as groups_file:
    for line in open('large_text_file.txt', 'r'):
        line_items = line.split(' ')  # split into a list
        if int(line_items[-1]) == 0:  # ignore the line if last value is 0
            continue
        sum_groups += int(line_items[-2])  # add second-to-last column to sum
        line_to_write = ' '.join(line_items[0:1] + line_items[2:]) + '\n'
        groups_file.write(line_to_write)   # write to file, after removing second column      

Python's file processing doesn't read in the whole file at once (we're reading only one line at a time, and when we take the next line the previous one gets garbage-collected), so this shouldn't take up too much memory unless the groups themselves are too large. Similarly with writing files, IIRC - if you needed to you could simply open an output file and write directly from the infile to the outfile instead of appending the result to groups, thus saving more memory.

This would, of course, be slower than batch-processing the entire file, but speed-for-space has always been the main tradeoff in computing.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
0
$ awk 'NR>1 && $NF {a[$1]+=$NF} 
       END         {for(k in a) print k, a[k]}' file

lcl|LGDX01000053.1_cds_KOV95325.1_1 10
lcl|LGDX01000053.1_cds_KOV95324.1_1 101
lcl|LGDX01000053.1_cds_KOV95322.1_1 1

since not matching other columns cannot be sure they are all identical, summarizing data this way will only have the key and aggregate data.

Explanation

look up awk syntax for the basics, for this script

NR>1 && $NF skip header (NR==1) and zero last fields

{a[$1]+=$NF} sum up last fields by the first field as key

END finally

{for(k in a) print k, a[k]} print all key value pairs

Community
  • 1
  • 1
karakfa
  • 66,216
  • 7
  • 41
  • 56