1

I am trying to read data from a tsv-file, but the format of the file is on the form

nm0007219   Donald Cook tt0042819
nm0457839   John Kitzmiller tt0045018   tt0042692
nm0777743   Karl Schwetter  tt0043483   tt0049422   tt0044322   tt0047989

I get the error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 5 My current code looks like this

TSV_file = pd.read_csv(filename, sep='\t', header=None)

Ultimately my goal is to find the amount of edges in the dataset

  • What is your expected dataframe after `read_csv` – Jab Oct 19 '21 at 14:03
  • what defines what an edge is in your dataset? My initial thought here is that a tabular data structure like a dataframe is not what you should be using. You probably should be reading the file line by line using the csv module to build some kind of adjacency list. – el_oso Oct 19 '21 at 14:16

1 Answers1

2

You could define column headings as follows:

import pandas as pd

df_tsv = pd.read_csv('input.tsv', sep='\t', header=None, names=['nm', 'Name', *(f'tt{i:02}' for i in range(1, 6))])
print(df_tsv)

This would give you a dataframe as:

          nm             Name       tt01       tt02       tt03       tt04  tt05
0  nm0007219      Donald Cook  tt0042819        NaN        NaN        NaN   NaN
1  nm0457839  John Kitzmiller  tt0045018  tt0042692        NaN        NaN   NaN
2  nm0777743   Karl Schwetter  tt0043483  tt0049422  tt0044322  tt0047989   NaN

You can either set the range to the required largest possible number of tt.... entries per row. Or set it to a large number and then remove all columns that are empty:

df_tsv = df_tsv.dropna(axis=1, how='all')   # remove empty columns
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • This looks like it would work but requires the user to know the maximum number of columns to expect. The comments on [this answer about handling a variable number of columns](https://stackoverflow.com/a/15252012/5906389) give a technique to read each line in the file to find the max columns: `num_cols = max(len(line.split(',')) for line in f)` where `f` is the file. It does require reading the file twice. – jslatane Oct 19 '21 at 14:16
  • 1
    I am guessing it would be much quicker to just overestimate the number of columns and discard them later using pandas rather than reading a file twice. Probably ok for small files though. – Martin Evans Oct 19 '21 at 14:22