0

I am working on a series of tab-delimited files which have a slightly odd structure. They are created with the bam-headcount package and contain sequence data and variant calls for each position in a short read of DNA sequence.

At some positions, there are no variant calls, at others there can be many. The number of tabs/columns in each row depends on the number of variant calls made (each variant will occupy a new column). For example:

234    A    3bp_del    4bp_ins
235    G
236    G.   15bp_ins   3bp_del    5bp_del

The difficulty arises when parsing the file with pandas using:

import pandas as pd
df = pd.read_csv(FILE, sep='\t')

This returns an error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 5

The error occurs because pandas determine the number of columns it expects to see using the number of columns in the first row. I have a clumsy workaround, which appends a header with multiple columns to the file before parsing, but it will always append the same number of headers. Example:

Pos    Ref  Call1      Call2       Call3
234    A    3bp_del    4bp_ins
235    G
236    G.   15bp_ins   3bp_del    5bp_del

I'm looking for a way to count the number of tabs in the row with the greatest number of columns so that I can write a script to append that many column headers to the first line of each CSV file before parsing.

Ian Tully
  • 41
  • 4
  • Possible duplicate of [Python Pandas Error tokenizing data](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data) - please see if any of the answers there can help you out. You can easily find similar post by pasting **pandas.errors.ParserError: Error tokenizing data** into the search on SO. – Patrick Artner Apr 25 '18 at 11:44

1 Answers1

0

To count the number of text blocks in a line you could use regex to count the non-whitespace blocks of text for each line (and in the end select the max value):

import re

column_counter = re.compile('\S+')

columns = []

with open( yourfile, 'r') as dna_file:
    for line in dna_file:
        columns.append(len(column_counter.findall(line)))

max_col_nr = max(columns)

There is also no need to add a header to the csv file. You overcome this by naming the columns when loading the file:

col_names = ['col_' + str(i) for i in range(max_col_nr)]

your_dataframe = pd.read_csv(yourfile, sep = '\t', names = col_names)

And if memory is not an issue you could also store each row in a list, and convert this list to a dataframe, so you do not need to load the file twice:

import re
import pandas as pd

rows = []

with open( yourfile, 'r') as dna_file:
    for line in dna_file:
        rows.append(re.findall('\S+',line))

dna_data = pd.DataFrame(rows)
J.vdS
  • 61
  • 4