3

This question is conceptually similar to the question here: Python Pandas: How to create a binary matrix from column of lists?, but due to the size of my data, I do not want to convert into a pandas data frame.

I have a list of lists like the following,

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

And I would like a binary matrix with each unique value as a column, and each sublist as a row.

How could this be done efficiently on over 100000 sublists with around 1000 items each?

Edit:

Example output is similar to the output in the question linked above, where the list could essentially be considered as:

list_ = [["a", "b"], ["c"], ["d"], ["e"]]

   a  b  c  d  e
0  1  1  0  0  0
1  0  0  1  0  0
2  0  0  0  1  0
3  0  0  0  0  1
Jack Arnestad
  • 1,845
  • 13
  • 26
  • You have a ragged list here. Can you explain what your output should look like? – cs95 Jun 05 '18 at 14:50
  • 1
    How many unique values are there in total? In the worst case, there will be `10**8` unique values, leading to `10**13` entries in the matrix, so you better have a few terabytes of memory to fit the matrix in. More to the point, why are you transforming your data to a less memory-efficient representation in the first place? Please provide more context about the problem you are solving. – Sven Marnach Jun 05 '18 at 14:56
  • @SvenMarnach I want to do a Fisher's exact test on each feature (number) and use it as a feature selection method. I have another list with a categorical assignment for each sublist. Perhaps it would be better to iterate through. If you could provide some insight on this that would be appreciated. – Jack Arnestad Jun 05 '18 at 14:57

2 Answers2

2

Using sklearn's CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
m = cv.fit_transform(list_)

# To transform to dense matrix
m.todense()

# To get the values correspond to each column
cv.get_feature_names()

# If you need dummy columns, not count
m = (m > 0)

You may want to keep it as sparsed matrix for memory reason.

phi
  • 10,572
  • 3
  • 21
  • 30
0

The values in subsets(rows) will be a position of 1(True) and 0(False) in the rest of columns:

import numpy as np

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

##################################
# convert to binary matrix
##################################
#find number of columns(dimenseion of matrix) 
nbr_of_columns = max(map(max, list_))+1 #maximun value in lists_

Mat = np.zeros((len(list_), nbr_of_columns), dtype=bool)
for i in range(0, len(list_)):
    for j in range(0, len(list_[i])):
        Mat[i, list_[i][j]] = True
        
print(Mat)

enter image description here