I have large csv file with 1000 columns, column 0 is an id, the other columns are categorical. I would like to convert them to integer values in order to use them for data analysis. First "dummy" way would work if I had enough memory:
filename_cat_train = "../input/train_categorical.csv"
df = pd.read_csv(filename_cat_train, dtype=str)
for column in df.columns[1:]:
df[column] = df[column].astype('category')
columns = df.select_dtypes(['category']).columns
df[columns] = df[columns].apply(lambda x: x.cat.codes)
df.to_csv("../input/train_categorical_rawconversion.csv", index=False)
but it lasts very long, and definitely not a smart way to solve the task.
I could just load the data file in chunks and then combine after converting to int values using the approach above. However when loading in chunks (even 100k large), not all categories are present in my data. This means, having values T10, T11, T13 in the first chunk, and T10, T11, T12 in the second, different values appear for categories in chunks.
The optimal way for me would be: 0. create the list of categorical and corresponding int values (there are only like 100, and it is easy to retrieve them all from the data) 1. Load data in chunks 2. substitute the values from the list 3. save each chunk and them combine them.
How could I perform such steps efficiently? Maybe better approach exists? Thanks!
Update1: the categorical data in of the same 'type. They are keys like T12, T45689, A3333 etc. the csv file is like that: 4,,,,,T12,,,,,,A44,,,,,,B3333,