1

I am doing one project in which I have to analyse the huge data which is on tornadoes happened in USA till now. As python is good for data analysis, I opted it. But I have some questions for which I need some clarification: 1. I am using pandas for data analysis. What I did till now is I have created one big dataframe(of 1GB .csv file) which contains all .csv files. Now lets suppose I want to calculate total deaths happened in year 2000. So I wrote query for that. Query is fetching results but it is taking some time. Is it good to store all the data in dataframe and fetch data? Or is there any other faster approach. 2. Another approach create json string of entire file and query that json string. I haven't done anything in this approach. Is is a good one?

Thank You.

2 Answers2

0

Pandas has some limitations regarding memory usage. Thats also a general python issue because memory allocation is lazy. Therefore as soon as your memory is not enough it gets dirty.

So i see two domains here. 1. Saving memory, 2. optimizing for time.

What you could consider:

  1. For memory efficiency: read this link. 1.1 If you need all your data at once (eg for super aggregative statistics (sum([all columns])) you could carefully drop some columns in your dataframe that are not needed. Or use instead switch to something else than pandas (eg. hdf5, pyrocksdb, leveldb ...) which would mean less comfortable analysis for you.
  2. For some operations a time factor is the correct setup of the pandas dataframe. Try checking your indexing schema and eg. avoid looping over rows. 2.2 Using numpy vector methods for some tasks will be significantly faster than pandas + python scripting.

  3. I personally also made very good experience in using mixed approaches like pandas + sql lite and than mini batching between them (see point 1).

Community
  • 1
  • 1
PlagTag
  • 6,107
  • 6
  • 36
  • 48
-1

Instead of pandas you can use the sframe library: https://dato.com/products/create/docs/generated/graphlab.SFrame.html

The sframe library allows you to save to a binary format that loads fast and is easily indexable. Sframe allows you to work with data sets that are much larger than your available RAM because it will work in batches and page data to disk. The library can also effectively utilize multiple cores to speed up joins and other operations, based on my experience it should be much faster for your use case.

The syntax is a bit less convenient than pandas but it is similar in functionality and has a to_dataframe() operator to convert sframes to pandas dataframes.

To install it:

pip install sframe

You can use the read_csv API to read the csv file, and then the save API to save it to a binary format, and then you can use the load API to load binary format. This is all covered on the link above.

Dan Taylor
  • 108
  • 8