0

i want to convert my categorical value into one hot encoding/get_dummies but i am having 4340 unique categorical value. I am getting memory error because of this. How can i handle this type of large categorical values.

Update:

Since the data is sensitive so i can not post all the data i am giving the column name something different but the values are like that and since it is not possible to post all 4340 unique categorical values i am giving small sample.

index|   ID
-------------
46   |    R05
61   |  M9901
72   |   J301
103  |   F411
135  |   R070
139  |   J069
Ironman
  • 1,330
  • 2
  • 19
  • 40
  • Can you add data sample? – jezrael Jul 25 '18 at 07:47
  • @jezrael please check – Ironman Jul 25 '18 at 07:54
  • It is more complicated as seems, maybe check [this](https://stackoverflow.com/a/44232474), but if native function is slow is possible use server with more `RAM` (what is practically very often not possible :() – jezrael Jul 25 '18 at 08:04
  • How many rows (data samples) do you have? 4320 categories will be converted into `number of rows x 4320 columns`, So I think you have too many rows? Now if you have the list of all the categories. So it all depends on the usage of data after one-hot encoding. Have you tried [LabelEncoder + OneHotEncoder](https://stackoverflow.com/a/48079345/3374996)? OneHotEncoder will return sparse matrices which will save you memory. – Vivek Kumar Jul 25 '18 at 10:26
  • but is it a good approach to have 4320 columns ? – Ironman Jul 25 '18 at 15:01
  • That is an entirely different question. You can replace very rarely occuring categories with a new fixed category ("Others", or "Unknown" etc), which will bring down the columns. But then you need to make sure that it does not affect the performance much. – Vivek Kumar Jul 27 '18 at 06:55

0 Answers0