Import tab delimited file as pandas dataframe when column orders in not fixed in each row

Question

I am trying to import a tab-delimited sam file as a pandas dataframe.

NB501670:42:HJL7WAFXX:1:11209:17120:18358   83  chr1    13182   0   86M =   13178   -90 CAGCTGTAACTCAAAGCCTTAGCCTCTGTTCCCACGAAGGCAGGGCCATCGGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTG  EEEAEEE6EEEAE//EEAEEEAAA/EEEEEAEAEEEEEAEEEEEEE//EEAAEAEEEEEEEEEAEEEEEE/EEEEEEEEEEAEEEE  MC:Z:7S90M20S   MD:Z:50A35  RG:Z:Sample NM:i:1  AS:i:81 XS:i:81 RX:Z:TCCAAGAA
NB501670:42:HJL7WAFXX:3:11411:9444:15777    83  chr1    19434   0   20M =   19335   -119    GGTGGAGGGGCTGCAGACTC    AEAAE/EEEEE/AEEAEE/E    MC:Z:20S39M MD:Z:20 RG:Z:Sample NM:i:0  AS:i:20 XS:i:20 RX:Z:TACTCTTC
NB501670:42:HJL7WAFXX:1:11212:2247:4550 99  chr1    22984   0   115M8S  =   22984   115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEAEEEAEEEAEEAEEEEAEEEEEEEEEEEAEEEEAEEEAAAEEE XA:Z:chr9,+23097,115M8S,0;chr19,+64592,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3;   MC:Z:18S115M8S  MD:Z:115    RG:Z:Sample NM:i:0  AS:i:115    XS:i:115    RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:3:11508:18628:11422   99  chr1    22984   0   115M8S  =   22984   115 TCTTCCCTAGGTGTCCCTCGGGCACATTTAGCACAAAGATAAGCACAAAAGGTGCATCCAGCACTTTGTTACTATTGGTGGCAGGTTTATGAATGGCAACCAAAGGCAGTGTACGTCCTCACT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEE/EEEEEEAAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEA/EAAAEAAA XA:Z:chr19,+64592,115M8S,0;chr9,+23097,115M8S,0;chr15,-102508066,8S115M,1;chr2,-114347920,8S115M,3;chr12,-80579,8S115M,3;   MC:Z:18S115M8S  MD:Z:115    RG:Z:Sample NM:i:0  AS:i:115    XS:i:115    RX:Z:TCTCATCT
NB501670:42:HJL7WAFXX:2:21203:5598:10862    83  chr1    25804   0   130M    =   25783   -151    AGTGGGGCCCTTGGTTGCAACACAAGTAGGTGGGGATGGATGAGTGTGGCATGAAGGGCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTGCCGTCTCTGAATCTGATCCTTTC  EEEAAEEEEEEAEAEE<EEE/EEAEEEEEEEE/AAEEEEEEEEEEEEEEEE<EEEEEAEEAEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  XA:Z:chr19,-67412,130M,1;chr12,+77744,130M,2;chr15,+102505230,130M,2;chr9,-25917,130M,2;chr2,+114345085,130M,2; MC:Z:9S142M MD:Z:31A98  RG:Z:Sample NM:i:1  AS:i:125    XS:i:125    RX:Z:GTTCGATA
NB501670:42:HJL7WAFXX:1:21308:24556:17558   83  chr1    25843   0   5M1I111M    =   25848   -111    ATGAGATGTGGCATGAAGGCCCTAGGAGATTTCACTTGGGTTTAAAATGCTGTGACCTTGAGTAAGTTTCCGTCTCTGAATCTGATCCTTTCGATTTCCCATTCTCCAAACTGAGAA   AA<E<EEAA<A<<EE6A/AEEAEAAAEAEEAAEEEAAA6EEEEAEEEEEEEA<EEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE   XA:Z:chr9,-25956,5M1I111M,3;chr19,-67451,5M1I111M,3;chr12,+77719,111M1I5M,4;chr15,+102505205,111M1I5M,4;    MC:Z:8S111M20S  MD:Z:18G48G48   RG:Z:Sample NM:i:3  AS:i:101    XS:i:101    RX:Z:TGTGGTAT

Below is my code to read in the file as a dataframe.

InSamFile = r'truq_chr1_10M.R1.sorted.txt'
max_n=22
df = pd.read_csv(InSamFile, sep='\t',comment='#', dtype=str, names=range(max_n))
df.head()

The code above imports the file as shown here:

In the dataframe, the initial 11 columns are available for all the rows, so they are imported correctly. However, in some rows, when the MC:Z:xxxxx tag is present, it mixes in with the MD:Z:xxxx tag in that column. Therefore, some columns get shifted during the import.

Can you suggest how to read_csv while checking for the start of the column table, e.g. if it starts with MD values then put all the values in MD column, and the same when it starts with RG, NM, etc. and when no value with the specific tag is found, then put NA or keep it empty? First 12 columns can be skipped for such check since they are always present for all the rows in the correct order. This way the column with the MC tag will be empty for most rows.

Any suggestion to be implemented while reading the file or later while processing the dataframe will be appreciated. I can do this using awk by reading all the columns one by one and match the start of the column if MD/MC/etc. and assign accordingly. But I am new to python and looking for some help.

Amit

With awk I can read each column in line and match for a string in below way. `awk -F'\t' -v OFS='\t' '{for (i=1; i<=NF; i++) if ($i ~ "^NM") NM=$i; for (i=1; i<=NF; i++) if ($i ~ "^RX") UMI=$i; for (i=1; i<=NF; i++) if ($i ~ "^MD") MD=$i; print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,NM,RX,MD}' $SampleID.R1.sorted.sam` — Amit Goyal, Dec 29 '17 at 07:52
Below is the link for file. [link](https://pusanackr-my.sharepoint.com/personal/amitgsir_pusan_ac_kr/_layouts/15/guestaccess.aspx?docid=0795ea2d25fb2480f80244f7cf929f835&authkey=AUUeNs9y52URdtIDKzey0K8&e=10cb73ecdcb14e138da7eca097dc09c8) — Amit Goyal, Dec 29 '17 at 08:12
Sorry, that link doesn't work. In any case, it would be better to edit your question to provide a few lines of the file that illustrate the problem. — Imran, Dec 29 '17 at 08:15
Sorry, I made the correct way for link. Please find in above comment. Thanks! [link](https://pusanackr-my.sharepoint.com/personal/amitgsir_pusan_ac_kr/_layouts/15/guestaccess.aspx?docid=0795ea2d25fb2480f80244f7cf929f835&authkey=AUUeNs9y52URdtIDKzey0K8&e=10cb73ecdcb14e138da7eca097dc09c8) — Amit Goyal, Dec 29 '17 at 08:16

Import tab delimited file as pandas dataframe when column orders in not fixed in each row

0 Answers0