0

I try to use multiprocessing to read the csv file faster than using read_csv.

df = pd.read_csv('review-1m.csv', chunksize=10000)

But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use

df = pd.concat(tp, ignore_index=True)

to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?

huier
  • 165
  • 2
  • 4
  • 12
  • 1
    Just to be thorough, returning the TextFileReader is the [expected behavior](http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking) when chunking. Using `tp.read()` seems to be a way to read all the data at once as per [this SO question](https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize). – Dan Dec 12 '17 at 18:06
  • 2
    ... if you want to materialize the entire-csv into a data-frame, why do you use `chunksize`...? I don't get what you are trying to do here. What do you mean you "use pool"? – juanpa.arrivillaga Dec 12 '17 at 18:42
  • @juanpa.arrivillaga Sorry, when I say use pool what I mean is using multiprocessing. We use multiprocessing to make the process of reading csv file faster than purely using read_csv. – huier Dec 12 '17 at 21:51
  • @Dan But I already try the code 'df = pd.concat(tp, ignore_index=True)' and that costs over 10 seconds for the csv file over 1 million rows. – huier Dec 12 '17 at 21:59
  • Then you want to rethink your algorithm as @juanpa.arrivillaga says. Why read the csv in chunks just to concat again into a DataFrame? That negates the benefits of chunking. Think about your algorithm on the DataFrame and rewrite it to run on each chunk instead. This is similar to what [Blaze](http://blaze.readthedocs.io/en/latest/ooc.html) tries to do for out-of-core processing. Process the chunks in memory, store them if necessary, and in the final step you can pull the results together into one DataFrame or file. – Dan Dec 12 '17 at 22:12
  • @Dan Thank you! I would reconsider the algorithm in the code. – huier Dec 12 '17 at 22:16

1 Answers1

0

pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.

http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

Gabriel A
  • 1,779
  • 9
  • 12
  • Thanks! But this is a school project and using csv as the file format is one of the requirements. So I don't think that I could use other formats of my files. – huier Dec 12 '17 at 22:14
  • What sort of operation are you trying to do on the data? – Gabriel A Dec 12 '17 at 23:03
  • We apply sql command on those csv files, so we do operations like SELECT and JOIN and WHERE on csv files. – huier Dec 13 '17 at 01:28