1

I have seen many questions concerning the filtering of a pandas data frame with millions of rows. However, I have a specific requirement of filtering my data frame with 60000 columns.

dataframe

As seen in the image below, the shape is 25000 rows by 57827 columns. I want to filter the data frame based on the column "Label", which has only values 1 and 0.

Normally, it should work when I do the below:

negative_review= Train_clean[Train_clean['label'] == "0"]
positive_review = Train_clean[Train_clean['label'] == "1"]

However, I run out of memory and the kernel dies. The only conclusion I could derive is due to the presence of a huge number of columns.

I found only one question in SO similar to mine but it does not solve my problem: link to question

Please provide me with some guidance

Amal Sailendran
  • 341
  • 1
  • 2
  • 16
  • 3
    probably not the answer you're looking for but I would rethink a data model that had 60,000 columns in a single table/DataFrame. Good luck. – JD2775 Apr 13 '21 at 03:10
  • Could you explain a bit more? I did not quite catch the idea – Amal Sailendran Apr 13 '21 at 03:17
  • 50,000 columns is quite excessive, you should consider eliminating irrelevant columns or possibly LDA or scaling down – Ade_1 Apr 13 '21 at 03:21
  • two options you can explore: 1> can you put a filter before the merge step? 2> is merge need with a key? so it's not creating all the permutation when the merge is based on df.columns. – simpleApp Apr 13 '21 at 03:21
  • 60,000 columns tells me you have data that is probably not all related to each other, and should be split into multiple data frames, or you are portraying values as actual column names. Either way, it could use a remodel – JD2775 Apr 13 '21 at 03:22
  • So the merge is required as it is part of creating a spam filter for use in the Naive Bayes classifier. As such, I cannot eliminate columns or scale them down. I don't think filtering before the merge is a good idea for my requirement. The 50000 columns are basically a list of words and the number of times it occurs in each of the sentences. This is how the merge happens: Train_clean = pd.concat([Train, word_counts], axis=1). It is after this step almost 50000 columns get added. Hope it makes sense – Amal Sailendran Apr 13 '21 at 03:26
  • The df seems pretty sparse, so maybe you could [convert to sparse or shrink the int size](https://stackoverflow.com/a/66672335/13138364), e.g. `.astype('Sparse[int]')` or `.astype('int8')` depending on your max word count – tdy Apr 13 '21 at 03:39
  • It still gets killed, the kernel – Amal Sailendran Apr 13 '21 at 03:51
  • @AmalSailendran, I agree with the rest of them. I strongly recommend that you revisit how you are trying to do the analysis. You maybe able to group some of the words into one unit. For ex: `laughed` can be grouped with other words that have similar meaning. Then you have a subset of columns to work with. Even if you transpose the dataframe, you will have only 25000 columns compared to 57827 columns. Think of an alternate datamodel before you start working on this data – Joe Ferndz Apr 13 '21 at 05:32
  • Hi Joe, if you are talking about stemming, I have already done that which almost reduced 20000 words to 57827 ones. As for transposing, is it a good approach? – Amal Sailendran Apr 13 '21 at 05:40

0 Answers0