0

I have data in CSV files. I am separating the data into columns using a single tab character. Most of the rows just contain one tab character, like this:

A\tB

Some rows contain extra tabs at the end of the row, like this:

A\tB\t\t

Hence, if I do pd.read_csv(filePath, sep='\t'), then I get an error: ParserError: Error tokenizing data. c error: Expected 2 fields in line XXX, saw 4. That's because some rows contain 4 tabs.

So how can I ignore the tabs at the end of a row, if it contains extra tabs?

Data
  • 689
  • 7
  • 23
  • 1
    Mostly duplicate of [python - Pandas, read CSV ignoring extra commas - Stack Overflow](https://stackoverflow.com/questions/48668125/pandas-read-csv-ignoring-extra-commas) except the separator. – user202729 Dec 08 '21 at 12:10
  • Specify two extra columns (or just all four) when reading the data (with the `names` argument), then drop the last two columns after having read the dataframe. I *think* (not sure) that lines with just 2 columns will fill up the remaining columns with NaNs/Nones. – 9769953 Dec 08 '21 at 12:11
  • 1
    Does this answer your question? [Pandas, read CSV ignoring extra commas](https://stackoverflow.com/questions/48668125/pandas-read-csv-ignoring-extra-commas) – 9769953 Dec 08 '21 at 12:12
  • @user202729 Thanks, that seems to be a good duplicate. `usecols` had escaped my attention, until now. – 9769953 Dec 08 '21 at 12:13

1 Answers1

2

Use io.StringIO to clean file before:

import pandas as pd
import io

with open('data.txt') as table:
    buffer = io.StringIO('\n'.join(line.strip() for line in table))
    df = pd.read_table(buffer, header=None)

Output:

>>> df
   0  1
0  A  B
1  A  B
Corralien
  • 109,409
  • 8
  • 28
  • 52