0

I have Pandas data frame big data frame loaded into memory. Trying to utilize memory more efficient way.

For this purposes, i won't use this data frame after i will subset from this data frame only rows i am interested are:

DF = pd.read_csv("Test.csv")
DF = DF[DF['A'] == 'Y']

Already tried this solution but not sure if it most effective. Is solution above is most efficient for memory usage? Please advice.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Felix
  • 1,539
  • 8
  • 20
  • 35

1 Answers1

1

you can try the following trick (if you can read the whole CSV file into memory):

DF = pd.read_csv("Test.csv").query("A == 'Y'")

Alternatively, you can read your data in chunks, using read_csv()

But i would strongly recommend you to save your data in HDF5 Table format (you may also want to compress it) - then you could read your data conditionally, using where parameter in read_hdf() function.

For example:

df = pd.read_hdf('/path/to/my_storage.h5', 'my_data', where="A == 'Y'")

Here you can find some examples and a comparison of usage for different storage options

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419