2

I have large csv file with 1000 columns, column 0 is an id, the other columns are categorical. I would like to convert them to integer values in order to use them for data analysis. First "dummy" way would work if I had enough memory:

filename_cat_train = "../input/train_categorical.csv"
df = pd.read_csv(filename_cat_train, dtype=str)

for column in df.columns[1:]:
    df[column] = df[column].astype('category')

columns = df.select_dtypes(['category']).columns
df[columns] = df[columns].apply(lambda x: x.cat.codes)

df.to_csv("../input/train_categorical_rawconversion.csv", index=False)

but it lasts very long, and definitely not a smart way to solve the task.

I could just load the data file in chunks and then combine after converting to int values using the approach above. However when loading in chunks (even 100k large), not all categories are present in my data. This means, having values T10, T11, T13 in the first chunk, and T10, T11, T12 in the second, different values appear for categories in chunks.

The optimal way for me would be: 0. create the list of categorical and corresponding int values (there are only like 100, and it is easy to retrieve them all from the data) 1. Load data in chunks 2. substitute the values from the list 3. save each chunk and them combine them.

How could I perform such steps efficiently? Maybe better approach exists? Thanks!

Update1: the categorical data in of the same 'type. They are keys like T12, T45689, A3333 etc. the csv file is like that: 4,,,,,T12,,,,,,A44,,,,,,B3333,

Vitaliy
  • 137
  • 1
  • 10

1 Answers1

1

In this case, it indeed seems that a two-pass scheme might be effective.

Starting with

import pandas as pd
data=pd.read_csv(my_file_name, chunksize=my_chunk_size)

You could do:

import collections
uniques = collections.defaultdict(list)
for chunk in data: 
    for col in chunk:
        uniques[col].update(chunk[col].unique())

At this point, uniques should map each column name to the unique items appearing in it. To translate to a map, you can now use

for col in uniques:
   uniques[col] = dict((e[1], e[0]) for e in enumerate(uniques[col]))

Now read the file again, and translate each column using the map corresponding to it (see here.)


If your columns all contain keys from "the same dictionary", you can do the following:

Starting with the following

import pandas as pd
data=pd.read_csv(my_file_name, chunksize=my_chunk_size)

You could do:

uniques = set([])
for chunk in data: 
    for col in cols:
        uniques.update(chunk[col].unique())

At this point, uniques should contain the unique items appearing in the DataFrame. To translate to a map, you can now use

uniques = dict((e[1], e[0]) for e in enumerate(uniques))

Now, load the DataFrame again, and use pd.DataFrame.replace.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks! How about if the categorical data in all columns are the same. They are keys like T12, T45689, A3333 etc. the csv file is like that: 4,,,,,T12,,,,,,A44,,,,,,B3333, (Update1 in post) – Vitaliy Oct 02 '16 at 09:57
  • No, I mean that your solution and like works great, but all the column have the same data. I mean I would like to replace categorical keys to integers like in your link, but to the whole data, not column-wise – Vitaliy Oct 02 '16 at 10:02
  • I mean all the columns contain data from the same 'dictionary', and I would like to replace them in the whole dataframe – Vitaliy Oct 02 '16 at 10:03
  • @Vitaliy Let me update the answer in a few minutes to what I managed to get. If it's not what you meant, feel free to comment on that, OK? – Ami Tavory Oct 02 '16 at 10:03
  • @Vitaliy See update. LMK if this doesn't address your question. – Ami Tavory Oct 02 '16 at 10:13
  • @Tavory thank you very much, it looks like exactly what I need! I will give it a try today's evening. – Vitaliy Oct 02 '16 at 10:16
  • @Vitaliy Great. All the best. – Ami Tavory Oct 02 '16 at 10:24
  • @Tavory. Thanks, it works brilliant! One follow-up question: how can I count the occurrences? – Vitaliy Oct 02 '16 at 12:00
  • Instead of a set, use collections.Counter, and pass in the column values themselves, without .unique(). – Ami Tavory Oct 02 '16 at 12:07
  • Thank you so much! – Vitaliy Oct 02 '16 at 12:16