Reading CSV file with different length

Question

I am trying to read data from a tsv-file, but the format of the file is on the form

nm0007219   Donald Cook tt0042819
nm0457839   John Kitzmiller tt0045018   tt0042692
nm0777743   Karl Schwetter  tt0043483   tt0049422   tt0044322   tt0047989

I get the error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 2, saw 5 My current code looks like this

TSV_file = pd.read_csv(filename, sep='\t', header=None)

Ultimately my goal is to find the amount of edges in the dataset

what defines what an edge is in your dataset? My initial thought here is that a tabular data structure like a dataframe is not what you should be using. You probably should be reading the file line by line using the csv module to build some kind of adjacency list. — el_oso, Oct 19 '21 at 14:16

Martin Evans · Answer 1 · 2021-10-19T14:45:26.760

2

You could define column headings as follows:

import pandas as pd

df_tsv = pd.read_csv('input.tsv', sep='\t', header=None, names=['nm', 'Name', *(f'tt{i:02}' for i in range(1, 6))])
print(df_tsv)

This would give you a dataframe as:

          nm             Name       tt01       tt02       tt03       tt04  tt05
0  nm0007219      Donald Cook  tt0042819        NaN        NaN        NaN   NaN
1  nm0457839  John Kitzmiller  tt0045018  tt0042692        NaN        NaN   NaN
2  nm0777743   Karl Schwetter  tt0043483  tt0049422  tt0044322  tt0047989   NaN

You can either set the range to the required largest possible number of tt.... entries per row. Or set it to a large number and then remove all columns that are empty:

df_tsv = df_tsv.dropna(axis=1, how='all')   # remove empty columns

edited Oct 19 '21 at 14:45

answered Oct 19 '21 at 14:09

Martin Evans

45,791
17
81
97

This looks like it would work but requires the user to know the maximum number of columns to expect. The comments on [this answer about handling a variable number of columns](https://stackoverflow.com/a/15252012/5906389) give a technique to read each line in the file to find the max columns: `num_cols = max(len(line.split(',')) for line in f)` where `f` is the file. It does require reading the file twice. – jslatane Oct 19 '21 at 14:16
1

I am guessing it would be much quicker to just overestimate the number of columns and discard them later using pandas rather than reading a file twice. Probably ok for small files though. – Martin Evans Oct 19 '21 at 14:22

Reading CSV file with different length

1 Answers1