I have a large text file like this:
#RefName Pos Coverage
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 0 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 1 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 2 1
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 3 0
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 4 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 5 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 6 101
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 7 10
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 8 0
The first row is the header, which can be ignored or deleted. I have two separate goals:
1) I want to extract all rows with the value in the last column not 0. 2) I want to group by the first column, and in the grouped file: delete the 2nd column, and sum the last column.
I know how to do these in pandas, but the file is >10G, loading into pandas itself is painful.
Is there a clean way to do these? Like using bash or awk something?
Thank you!