13

I'm trying to create a progress bar for reading excel data into pandas using tqdm. I can do this easily with a csv using the chunksize argument like so:

data_reader = pd.read_csv(path,
                          chunksize = 1000)

for row in tqdm(data_reader,
                total = 200):
    df_list = []
    df_list.append(row)

Which updates the progress bar for every chunk of 1000 out of 200 total chunks. pd.read_excel, however, no longer has a chunksize argument. Is there an alternative?

Edit: I've read the question re: reading an excel file in chunks (Reading a portion of a large xlsx file with python), however, read_excel does not have a chunksize argument anymore and pd.ExcelFile.parse is equivalent. I am wondering if there is an alternative to the chunksize argument or another way to create an iterable to loop over chunks while they are being read in.

jottbe
  • 4,228
  • 1
  • 15
  • 31
jsxgd
  • 403
  • 1
  • 5
  • 16
  • 1
    This is a duplicate, see [this question](https://stackoverflow.com/questions/38623368/reading-a-portion-of-a-large-xlsx-file-with-python/38623545) for an answer. – VinceP Jun 26 '17 at 17:11
  • 1
    @VinceP Not a duplicate, please read edit. Those answers do not work as there is no chunksize argument anymore. I am looking for alternatives. – jsxgd Jun 26 '17 at 17:26
  • 1
    I've never seen an Excel file larger than memory. Out of curiosity, what's the use-case? – Andrew L Jun 26 '17 at 20:35
  • @AndrewL My code is being used to automate some basic reccuring ETL functions for work. It pulls in a book of medical claims that need to be worked as well as some other data to filter, analyze, segment, etc. before the claims get touched by a human being. The excel file has 105 columns and is about 2000 rows so it takes about a full minute to read in. There's not option to convert to csv to make it faster. I want to turn the import process into an iterable so I can show the user a progress bar with tqdm. read_csv lets me do this (as stated in the question) but read_excel does not. – jsxgd Jun 27 '17 at 02:11
  • More of a UX comfort than a necessity, if I were just pulling the data in for an analysis then it would be one thing but my non-technical colleagues are more annoyed by the wait than I am, particular because this is a twice-weekly process for them. – jsxgd Jun 27 '17 at 02:15
  • I could not find an easy solution. [This]('https://openpyxl.readthedocs.io/en/default/optimized.html') might help you (it uses `openpyxl`), but I am guessing the solution is not going to be as pythonic as having a `chunksize` argument – VinceP Jun 28 '17 at 09:02

3 Answers3

1

If you want to add a progress indicator, you could use the .tell() method of file objects. That's of course not quite accurate, but maybe it gives your users enough accuracy to estimate, how long a coffee break they can make :-)

So here is the plan: basically open your excel file with open and pass the result object to pd.read_excel. According to the docs, this should be possible, and I just verified it with a simple example for an xlsx file.

At the beginning you evaluate the size of the file. E.g by:

import io
fp.seek(0, io.SEEK_END) # set the file cursor to the end of the file
fp_len= fp.tell()
fp.seek(0, io.SEEK_SET) # set the file cursor back to the beginning of the file

With this setup, you have two possibilities:

  1. Either you create a thread, that updates the progress bar from time to time by calling fp.tell() on the file object, you opened for the xlsx file, or
  2. create your own wrapper, that provides the methods, pandas needs to read the data (at least a read method) and updates the progress bar synchronously, so you don't need am extra thread. Your class would just need to pass the method calls to the actual file class. In that sense you could compare it with a proxy object.

I have to admit, that 2 is kind of dirty. But I'm convinced that both methods would work, because I just verified, that pd.read_excel really can read from a file object (io.BufferedReader), also xlsx files, which are afaik zipped files. This method would just not be so accurate, because the file pointer might not move linear with time depending on things like fluctuations in the compression rate (some parts of the file might be compressable with a higher rate, than others).

jottbe
  • 4,228
  • 1
  • 15
  • 31
0

The best you can do is use pandas.read_excel with the skiprows (skips rows from the top of the file) and skip_footer (skips rows from the bottom) arguments. This however will load the whole file to memory first and then parse the required rows only.

sara
  • 45
  • 2
  • 13
0

That parameter was there but it never did anything, so it was removed. See this issue on github.

You need to take a different approach to do that, as the others have pointed out.

Qusai Alothman
  • 1,982
  • 9
  • 23