import csv with different number of columns per row using Pandas

Question

What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.

"H","BBB","D","Ajxxx Dxxxs"
"R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"

Using this code:

import pandas as pd
data = pd.read_csv("smallsample.txt",header = None)

the following error is generated

Error tokenizing data. C error: Expected 4 fields in line 2, saw 8

There may be a pandas way to get around this, but you could also write a quick script to append some `,""` for each column that a line is lacking. By lacking, I mean compared to the row with the most columns. — Nicholas Flees, Nov 19 '14 at 15:14
@NicholasFlees: using `range(n)` in `pd.read_csv()` function, where **n** is the number of columns required does the job, without needing to append `,""` empty strings equivalent to number of columns required — aspiring1, Apr 13 '22 at 12:22

score 37 · Accepted Answer · answered Nov 19 '14 at 15:43

37

Supplying a list of columns names in the read_csv() should do the trick.

ex: names=['a', 'b', 'c', 'd', 'e']

https://github.com/pydata/pandas/issues/2981

Edit: if you don't want to supply column names then do what Nicholas suggested

answered Nov 19 '14 at 15:43

Bob Haffner

8,235
1
36
43

P-S · Answer 2 · 2021-05-26T08:06:39.470

You can dynamically generate column names as simple counters (0, 1, 2, etc).

Dynamically generate column names

# Input
data_file = "smallsample.txt"

# Delimiter
data_file_delimiter = ','

# The max column count a line in the file could have
largest_column_count = 0

# Loop the data lines
with open(data_file, 'r') as temp_f:
    # Read the lines
    lines = temp_f.readlines()

    for l in lines:
        # Count the column count for the current line
        column_count = len(l.split(data_file_delimiter)) + 1
        
        # Set the new most column count
        largest_column_count = column_count if largest_column_count < column_count else largest_column_count

# Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
column_names = [i for i in range(0, largest_column_count)]

# Read csv
df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names)
# print(df)

Missing values will be assigned to the columns which your CSV lines don't have a value for.

Just minor issue. Do you really have to do `temp_f.close()`? The with statement should automatically close it for you. — demongolem, Sep 22 '20 at 15:00

score 16 · Answer 3 · answered Sep 06 '19 at 14:42

Polished version of P.S. answer is as follows. It works. Remember we have inserted lot of missing values in the dataframe.

### Loop the data lines
with open("smallsample.txt", 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Read csv
df = pd.read_csv("smallsample.txt", header=None, delimiter=",", names=column_names)

heothesennoc · Answer 4 · 2022-01-10T04:15:00.543

15

If you want something really concise without explicitly giving column names, you could do this:

Make a one column DataFrame with each row being a line in the .csv file
Split each row on commas and expand the DataFrame

df = pd.read_fwf('<filename>.csv', header=None)
df[0].str.split(',', expand=True)

edited Jan 10 '22 at 04:15

answered Oct 03 '19 at 22:03

heothesennoc

541
5
10

Nailed it, thanks. Here's the doc link for anyone else who wasn't aware of this function. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html – Ejaz Feb 26 '20 at 16:44
2

Be warn that this would only work for very basic csv files without commas within quoted strings. So `a,b,"c,d,e",f` would be interpreted as `a b "c d e" f` rather than `a b c,d,e f` – Rae Jun 29 '20 at 22:34

score 12 · Answer 5 · answered Feb 16 '20 at 09:46

Error tokenizing data. C error: Expected 4 fields in line 2, saw 8

The error gives a clue to solve the problem "Expected 4 fields in line 2", saw 8 means length of the second row is 8 and first row is 4.

import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8 
data = pd.read_csv("smallsample.txt",header = None,names=range(8))

Use range instead of manually setting names as it will be cumbersome when you have many columns.

You can use shantanu pathak's method to find longest row length in your data.

Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)

new_data = data.fillna(0)

score 3 · Answer 6 · answered Sep 11 '17 at 06:10

3

We could even use pd.read_table() method to read csv file which converts it into type DataFrame of single columns which can be read and split by ','

answered Sep 11 '17 at 06:10

kavin

96
6

score 1 · Answer 7 · answered Nov 03 '21 at 15:52

1

Manipulate your csv and in the first row, put the row that has the most elements, so that all next rows have less elements. Pandas will create as much columns as the first row has.

answered Nov 03 '21 at 15:52

idmoreno

31
4

This isn't a pragmatic thing to do, neither does this answer even affirm to programming - it's just a hack. – aspiring1 Apr 13 '22 at 12:26
1

agree. it is just a hack. – idmoreno Jul 20 '22 at 18:58

import csv with different number of columns per row using Pandas

7 Answers7

Linked