0

When I run this code and look at the output of info() the DataFrame that uses Category types seems to take more space (932 bytes) then the DataFrame that uses Object types (624 bytes).

def initData():
    myPets = {"animal":         ["cat",    "alligator", "snake",     "dog",    "gerbil",  "lion",      "gecko",  "hippopotamus", "parrot",   "crocodile", "falcon",   "hamster", "guinea pig"],
              "feel"  :         ["furry",  "rough",     "scaly",     "furry",  "furry",   "furry",     "rough",  "rough",        "feathery", "rough",     "feathery", "furry",   "furry"     ],
              "where lives":    ["indoor", "outdoor",   "indoor",    "indoor", "indoor",  "outdoor",   "indoor", "outdoor",      "indoor",   "outdoor",   "outdoor",  "indoor",  "indoor"    ],
              "risk":           ["safe",   "dangerous", "dangerous", "safe",   "safe",    "dangerous", "safe",   "dangerous",    "safe",     "dangerous", "safe",     "safe",    "safe"      ],
              "favorite food":  ["treats", "fish",      "bugs",      "treats", "grain",   "antelope",  "bugs",   "antelope",     "grain",    "fish",      "rabbit",   "grain",   "grain"     ],
              "want to own":    [1,        0,           0,           1,        1,         0,           1,        0,              1,          0,           1,          1,         1           ] }
    petDF = pd.DataFrame(myPets)
    petDF = petDF.set_index("animal")
    #print(petDF.info())
    #petDF.head(100)
    return petDF

def addCategoryColumns(myDF):
    myDF["cat_feel"]          = myDF["feel"].astype("category")
    myDF["cat_where_lives"]   = myDF["where lives"].astype("category")
    myDF["cat_risk"]          = myDF["risk"].astype("category")
    myDF["cat_favorite_food"] = myDF["favorite food"].astype("category")
    return myDF

objectsDF = initData()
categoriesDF = initData()
categoriesDF = addCategoryColumns(categoriesDF)
categoriesDF = categoriesDF.drop(["feel", "where lives", "risk", "favorite food"], axis = 1)
print(objectsDF.info())
print(categoriesDF.info())
categoriesDF.head()


<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   feel           13 non-null     object
 1   where lives    13 non-null     object
 2   risk           13 non-null     object
 3   favorite food  13 non-null     object
 4   want to own    13 non-null     int64 
dtypes: int64(1), object(4)
memory usage: 624.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   want to own        13 non-null     int64   
 1   cat_feel           13 non-null     category
 2   cat_where_lives    13 non-null     category
 3   cat_risk           13 non-null     category
 4   cat_favorite_food  13 non-null     category
dtypes: category(4), int64(1)
memory usage: 932.0+ bytes
None
nicomp
  • 4,344
  • 4
  • 27
  • 60

1 Answers1

0

Numeric data, like int / float / category, is held in a numpy array. Put a million or two rows in it, so the bookkeeping overhead is insignificant, and you'll see that memory use is exactly 8 × num_elements, or a smaller multiple for data types smaller than 64 bits.

In contrast, an "object" dtype is a pointer to some externally allocated region of memory, typically a str. So numpy / pandas report on array size, 8 × num_elements when using 64-bit addresses, but leave it to you to sum up all those external allocations.


Use getsizeof recursively, or use pympler, to better understand total memory consumption. Or use psutil to ask the OS about memory resources before / after you make a big allocation.

J_H
  • 17,926
  • 4
  • 24
  • 44