Handling ragged CSV columns in pandas

Question

I have a CSV file containing data: (just the first ten rows of data are listed)

0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84

The first column indicates the row number (e.g the first column in the first row is 0). When i try to use

import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')

Error occurs as below:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12

I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.

Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.

score 2 · Accepted Answer · answered Sep 09 '17 at 04:31

2

For your first problem, you can pass a names=... parameter to read_csv:

df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')

As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.

answered Sep 09 '17 at 04:31

cs95

379,657
97
704
746

Its work! And I am trying to understand how to implement one hot encoder. :D – Ho Nam Cheung Sep 09 '17 at 13:16

Simon · Answer 2 · 2017-09-09T19:48:45.977

I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following

1.) You have a csv of numbers 2.) This is for a problem with 120 classes 3.) You want a matrix with 1s and 0s for each class 4.) Example a csv such as:

1, 3
2, 3, 6

would be the feature matrix

Column:
1, 2, 3, 6

1, 0, 1, 0
0, 1, 1, 1

Thus this code achieves that, but it is surely not optimized:

df = pd.read_csv(file, header=None, names=range(121), sep=',')      

one_hot = []
for k in df.columns:
    one_hot.append(pd.get_dummies(df[k]))

for n, l in enumerate(one_hot):
    if n == 0:
        df = one_hot[n]
    else:
        df = func(df1=df, df2=one_hot[n])

def func(df1, df2):
    # We can't join if columns overlap. Use set operations to identify
    non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
    overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))

    # Join where possible
    df2_join = df2[non_overlapping_columns]
    df3 = df1.join(df2_join)

    # Manually add columns for overlaps
    for k in overlapping_columns:
        df3[k] = df3[k]+df2[k]

    return df3

From here you could feed it into sklean onehot, as @cᴏʟᴅsᴘᴇᴇᴅ noted.

That would look like this:

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)

I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.

I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.

Thanks for your help too. I am trying to understand how to implement one hot encoder. :D — Ho Nam Cheung, Sep 09 '17 at 13:17
From there you'd just do something like: from sklearn.preprocessing import OneHotEncoder onehot = OneHotEncoder(df) import sys sys.getsizeof(onehot) sys.getsizeof(df) — Simon, Sep 09 '17 at 19:46
Actually I am trying to perform extreme classification with the help of neural network. here is the data set: http://manikvarma.org/downloads/XC/XMLRepository.html (the Mediamill one) — Ho Nam Cheung, Sep 10 '17 at 08:13
And I change the class label column into one hot encoder format first. — Ho Nam Cheung, Sep 10 '17 at 08:16

Handling ragged CSV columns in pandas

2 Answers2

Linked