6

How to upload two large (5GB) each csv file in local system Jupyter Notebook using python pandas. Please suggest any configuration to handle big csv files for data analysis ?

Local System Configuration:
OS: Windows 10
RAM: 16 GB
Processor: Intel-Core-i7

Code:

dpath = 'p_flg_tmp1.csv'
pdf = pd.read_csv(dpath, sep="|") 

Error:
MemoryError: Unable to allocate array

or

pd.read_csv(po_cust_data, sep="|", low_memory=False)

Error:
ParserError: Error tokenizing data. C error: out of memory

How to handle two bigger csv file in local system for data analysis? please suggested better configuration if possible in local system using python pandas.

Srinivas K
  • 103
  • 1
  • 9
  • 1
    check [`dask`](https://docs.dask.org/en/latest/dataframe.html) – anky Jan 24 '20 at 06:12
  • dask api performing good !! but many functions and filters not available like pandas api for data analysis. – Srinivas K Jan 24 '20 at 07:19
  • then check the other answers, related: https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas – anky Jan 24 '20 at 07:20
  • How does this csv-file looks like (How many columns and dtypes)? If you have written this csv-File think of using more efficient alternatives like the HDF5-format to handle bigger datasets. – max9111 Jan 24 '20 at 15:58
  • @max9111 we have maximum 10 column each file. all are primary data types (string, varchar, double, timestamp, number etc). – Srinivas K Jan 27 '20 at 03:21

3 Answers3

7

If you do not need to process everything at once you can use chunks:

reader = pd.read_csv('tmp.sv', sep='|', chunksize=4000)   
for chunk in reader:
     print(chunk)

see the Documentation of Pandas for further information.

If you need to process everything at once and chunking really isnt an option you have only two options left

  1. Increase RAM of your system
  2. Switch to another data storage type

A csv file takes an enormous amount of memory in RAM, see this article for more information even if it is for another software it gives a good idea about the problem:

Memory Usage

You can estimate the memory usage of your CSV file with this simple formula:

memory = 25 * R * C + F 

where R is the number of rows, C the number of columns and F the file size in bytes.

One of my test files is 524 MB large, contains 10 columns in 4.4 million rows. Using the formula from above the RAM usage will be about 1.6 GB:

memory = 25 * 4,400,000 * 10 + 524,000,000 = 1,624,000,000 bytes

While this file is opened in Tablecruncher the Activity Monitor reports 1.4 GB RAM used, so the formula represents a rather accurate guess.

Mailerdaimon
  • 6,003
  • 3
  • 35
  • 46
  • I can do with chunck but how to analyze data with other big data frame? – Srinivas K Jan 24 '20 at 07:12
  • Well, that totally depends on what exactly you need to do? If possible gather data per chunk and aggregate it once all your chunks are processed. But this only works if you want to analyse for certain key values (e.g. statistical values of your data). – Mailerdaimon Jan 24 '20 at 07:36
3

Use chunk to read data partially.

dpath = 'p_flg_tmp1.csv'

for pdf in pd.read_csv(dpath, sep="|", chunksize=1000):
    *do something here*
ikibir
  • 456
  • 4
  • 12
0

What's your overall goal here though? People are providing help with how to read it, but then what? You want to do a join/merge? You are gonna need more tricks to get through that.

But then what? Is the rest of your algorithm also chunkable? Are you going to have enough RAM left to process anything? And what about CPU performance? Is one little i7 enough? Do you plan on waiting hours or days for results? This might all be acceptable for your use case, sure, but we don't know that.

At a certain point, if you want to use big data, you need big computer(s). Do you really have to do this locally? Even if you aren't ready for distributed computing over clusters, you could just get an adequately sized VM instance. Your company will pay for it. They pay for themselves. It's much cheaper to give you a better computer than to pay you to wait around for a small one to finish. In India, the price ratios between labor / AWS costs is lower than in the US, sure, but it's still well worth it. Be like hey boss, do you want this to take 3 days or 3 weeks?

Realistically, your small computer problems are only going to get worse after reading in the CSV. I mean I don't know your use case, but that seems likely. You could spend a long time trying to engineer your way out of these problems, but much cheaper to just spin up an EC2 instance.

Chad Bernier
  • 386
  • 1
  • 10
  • Your are correct. We can't handle bigger files & big data in local system. My use case should be use local system, until Big data cluster ready for data analysis. – Srinivas K Jan 28 '20 at 06:25
  • Ok, then maybe your solution is to just down sample the data. Use like 20% of the data now so you can get started on your algorithm, and scale it up later. – Chad Bernier Jan 28 '20 at 20:37