1

I'm trying to import a large (approximately 4Gb) csv dataset into python using the pandas library. Of course the dataset cannot fit all at once in the memory so I used chunks of size 10000 to read the csv. After this I want to concat all the chunks into a single dataframe in order to perform some calculations but I ran out of memory (I use a desktop with 16gb RAM).

My code so far:

# Reading csv
chunks = pd.read_csv("path_to_csv", iterator=True, chunksize=1000)

# Concat the chunks
pd.concat([chunk for chunk in chunks])

pd.concat(chunks, ignore_index=True)

I searched many threads on StackOverflow and all of them suggest one of these solutions. Is there a way to overcome this? I can't believe I can't handle a 4 gb dataset with 16 gb ram!

UPDATE: I still haven't come up with any solution to import the csv file. I bypassed the problem by importing the data into a PostgreSQL then querying the database.

smci
  • 32,567
  • 20
  • 113
  • 146
Mewtwo
  • 1,231
  • 2
  • 18
  • 38
  • Why can't you fill the DataFrame in one go? – Elmex80s May 10 '17 at 11:54
  • Assuming memory problems, Have you tried Dask or Apache Spark? – OneCricketeer May 10 '17 at 11:54
  • Not sure how to do this.. Can you provide me an example? – Mewtwo May 10 '17 at 11:54
  • @cricket_007 No I haven't. I thought 4 gb weren't even close to "Big Data" – Mewtwo May 10 '17 at 11:56
  • It isn't, but you could at least distribute the processing rather than needing to feed the entire file into memory before doing anything – OneCricketeer May 10 '17 at 12:00
  • @cricket_007 Well If there is no other solution I guest that's indeed an option. But for the time I would prefer to exhaust all other solutions available – Mewtwo May 10 '17 at 13:00
  • Can you show the traceback? Is that all the code? – OneCricketeer May 10 '17 at 13:06
  • Possible duplicate of [Read large dataset Pandas](https://stackoverflow.com/questions/46833277/read-large-dataset-pandas/46834540#46834540) – Linford Bacon Oct 19 '17 at 16:25
  • `pd.read_csv` should be able to handle this natively, you should focus on reducing memory usage via the parameters `usecols`, `dtypes`, and converters for dates/times and other formatted fields. (@LinfordBacon: that's not a good dupe target, it's closed and it has one answer recommending a workaround using sqlite3, which should be a last resort.) – smci Mar 10 '19 at 06:27

1 Answers1

1

I once deal with this kind of situation using generator in python. I hope this will be helpful:

def read_big_file_in_chunks(file_object, chunk_size=1024):
        """Reading whole big file in chunks."""
        while True:
            data = file_object.read(chunk_size)
            if not data:
                break
            yield data


f = open('very_very_big_file.log')
for chunk in read_big_file_in_chunks(f):
    process_data(chunck)
Sijan Bhandari
  • 2,941
  • 3
  • 23
  • 36