0

Would it be more space efficient to convert columns with binary values to 'category' or 'int8' data type? I'm working with half a million rows and a couple thousand columns of binary values.

UPDATE: Just for clarification, the individual cells will be just a 0 or a 1, not a combination of them.

jma
  • 457
  • 7
  • 18
  • Even `0` will use up 1 byte. You'd think 1 bit *should* be possible, but this is not true. Your *best* option is to aggregate 8 binary values to a byte and store as array of `int`: look at [Converting Binary Numpy Array into Unsigned Integer](https://stackoverflow.com/questions/46184684/converting-binary-numpy-array-into-unsigned-integer). – jpp Mar 29 '18 at 17:15

1 Answers1

0

you can use sys.getsizeof() of course it's not as simple as I make it seem below but this could help.

import pandas as pd
import sys

string = pd.DataFrame({'str':['010101']},dtype='str')
cat = pd.DataFrame({'cat':['010101']}, dtype='category')
int8 = pd.DataFrame({'int':['010101']}, dtype='int8')
int32 = pd.DataFrame({'int':['010101']}, dtype='int32')

print(sys.getsizeof(string),string.dtypes)
print()
print(sys.getsizeof(cat), cat.dtypes)
print()
print(sys.getsizeof(int8), int8.dtypes)
print()
print(sys.getsizeof(int32), int32.dtypes)

out

181 str    object
dtype: object

262 cat    category
dtype: object

105 int    int8
dtype: object

108 int    int32
dtype: object
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41