How to speed up the process of chunk to dataframe?

Question

I try to use multiprocessing to read the csv file faster than using read_csv.

df = pd.read_csv('review-1m.csv', chunksize=10000)

But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use

df = pd.concat(tp, ignore_index=True)

to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?

Just to be thorough, returning the TextFileReader is the [expected behavior](http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking) when chunking. Using `tp.read()` seems to be a way to read all the data at once as per [this SO question](https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize). — Dan, Dec 12 '17 at 18:06
... if you want to materialize the entire-csv into a data-frame, why do you use `chunksize`...? I don't get what you are trying to do here. What do you mean you "use pool"? — juanpa.arrivillaga, Dec 12 '17 at 18:42
@juanpa.arrivillaga Sorry, when I say use pool what I mean is using multiprocessing. We use multiprocessing to make the process of reading csv file faster than purely using read_csv. — huier, Dec 12 '17 at 21:51
@Dan But I already try the code 'df = pd.concat(tp, ignore_index=True)' and that costs over 10 seconds for the csv file over 1 million rows. — huier, Dec 12 '17 at 21:59
Then you want to rethink your algorithm as @juanpa.arrivillaga says. Why read the csv in chunks just to concat again into a DataFrame? That negates the benefits of chunking. Think about your algorithm on the DataFrame and rewrite it to run on each chunk instead. This is similar to what [Blaze](http://blaze.readthedocs.io/en/latest/ooc.html) tries to do for out-of-core processing. Process the chunks in memory, store them if necessary, and in the final step you can pull the results together into one DataFrame or file. — Dan, Dec 12 '17 at 22:12
@Dan Thank you! I would reconsider the algorithm in the code. — huier, Dec 12 '17 at 22:16

score 0 · Answer 1 · answered Dec 12 '17 at 20:21

0

pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.

http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

answered Dec 12 '17 at 20:21

Gabriel A

1,779
9
12

Thanks! But this is a school project and using csv as the file format is one of the requirements. So I don't think that I could use other formats of my files. – huier Dec 12 '17 at 22:14
What sort of operation are you trying to do on the data? – Gabriel A Dec 12 '17 at 23:03
We apply sql command on those csv files, so we do operations like SELECT and JOIN and WHERE on csv files. – huier Dec 13 '17 at 01:28

How to speed up the process of chunk to dataframe?

1 Answers1