I am working on a series of tab-delimited files which have a slightly odd structure. They are created with the bam-headcount package and contain sequence data and variant calls for each position in a short read of DNA sequence.
At some positions, there are no variant calls, at others there can be many. The number of tabs/columns in each row depends on the number of variant calls made (each variant will occupy a new column). For example:
234 A 3bp_del 4bp_ins
235 G
236 G. 15bp_ins 3bp_del 5bp_del
The difficulty arises when parsing the file with pandas using:
import pandas as pd
df = pd.read_csv(FILE, sep='\t')
This returns an error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5
The error occurs because pandas determine the number of columns it expects to see using the number of columns in the first row. I have a clumsy workaround, which appends a header with multiple columns to the file before parsing, but it will always append the same number of headers. Example:
Pos Ref Call1 Call2 Call3
234 A 3bp_del 4bp_ins
235 G
236 G. 15bp_ins 3bp_del 5bp_del
I'm looking for a way to count the number of tabs in the row with the greatest number of columns so that I can write a script to append that many column headers to the first line of each CSV file before parsing.