I am trying to import a tab-delimited sam file as a pandas dataframe.
NB501670:42:HJL7WAFXX:1:11209:17120:18358 83 chr1 13182 0 86M = 13178 -90 CAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCGGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTG EEEAEEE6EEEAE//EEAEEEAAA/EEEEEAEAEEEEEAEEEEEEE//EEAAEAEEEEEEEEEAEEEEEE/EEEEEEEEEEAEEEE MC:Z:7S90M20S MD:Z:50A35 RG:Z:Sample NM:i:1 AS:i:81 XS:i:81 RX:Z:TCCAAGAA
NB501670:42:HJL7WAFXX:3:11411:9444:15777 83 chr1 19434 0 20M = 19335 -119 GGTGGAGGGGCTGCAGACTC AEAAE/EEEEE/AEEAEE/E MC:Z:20S39M MD:Z:20 RG:Z:Sample NM:i:0 AS:i:20 XS:i:20 RX:Z:TACTCTTC
NB501670:42:HJL7WAFXX:1:11212:2247:4550 99 chr1 22984 0 115M8S = 22984 115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEAEEEAEEEAEEAEEEEAEEEEEEEEEEEAEEEEAEEEAAAEEE XA:Z:chr9,+23097,115M8S,0;chr19,+64592,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3; MC:Z:18S115M8S MD:Z:115 RG:Z:Sample NM:i:0 AS:i:115 XS:i:115 RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:3:11508:18628:11422 99 chr1 22984 0 115M8S = 22984 115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEE/EEEEEEAAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEA/EAAAEAAA XA:Z:chr19,+64592,115M8S,0;chr9,+23097,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3; MC:Z:18S115M8S MD:Z:115 RG:Z:Sample NM:i:0 AS:i:115 XS:i:115 RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:2:21203:5598:10862 83 chr1 25804 0 130M = 25783 -151 AGTGGGGCCCTTGGTTGCAACACAAGTAGGTGGGGATGGATGAGTGTGGCATGAAGGGCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTGCCGTCTCTGAATCTGATCCTTTC EEEAAEEEEEEAEAEE<EEE/EEAEEEEEEEE/AAEEEEEEEEEEEEEEEE<EEEEEAEEAEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE XA:Z:chr19,-67412,130M,1;chr12,+77744,130M,2;chr15,+102505230,130M,2;chr9,-25917,130M,2;chr2,+114345085,130M,2; MC:Z:9S142M MD:Z:31A98 RG:Z:Sample NM:i:1 AS:i:125 XS:i:125 RX:Z:GTTCGATA
NB501670:42:HJL7WAFXX:1:21308:24556:17558 83 chr1 25843 0 5M1I111M = 25848 -111 ATGAGATGTGGCATGAAGGCCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTTCCGTCTCTGAATCTGATCCTTTCGATTTCCCATTCTCCAAACTGAGAA AA<E<EEAA<A<<EE6A/AEEAEAAAEAEEAAEEEAAA6EEEEAEEEEEEEA<EEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE XA:Z:chr9,-25956,5M1I111M,3;chr19,-67451,5M1I111M,3;chr12,+77719,111M1I5M,4;chr15,+102505205,111M1I5M,4; MC:Z:8S111M20S MD:Z:18G48G48 RG:Z:Sample NM:i:3 AS:i:101 XS:i:101 RX:Z:TGTGGTAT
Below is my code to read in the file as a dataframe.
InSamFile = r'truq_chr1_10M.R1.sorted.txt'
max_n=22
df = pd.read_csv(InSamFile, sep='\t',comment='#', dtype=str, names=range(max_n))
df.head()
The code above imports the file as shown here:
In the dataframe, the initial 11 columns are available for all the rows, so they are imported correctly. However, in some rows, when the MC:Z:xxxxx tag is present, it mixes in with the MD:Z:xxxx tag in that column. Therefore, some columns get shifted during the import.
Can you suggest how to read_csv while checking for the start of the column table, e.g. if it starts with MD values then put all the values in MD column, and the same when it starts with RG, NM, etc. and when no value with the specific tag is found, then put NA or keep it empty? First 12 columns can be skipped for such check since they are always present for all the rows in the correct order. This way the column with the MC tag will be empty for most rows.
Any suggestion to be implemented while reading the file or later while processing the dataframe will be appreciated. I can do this using awk by reading all the columns one by one and match the start of the column if MD/MC/etc. and assign accordingly. But I am new to python and looking for some help.
Amit