How do I create a progress bar when a DataFrame is initializing?

Question

I want to get the number of rows each time a new one is created when I load a .csv file into a dataframe :

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

csv_path = "C:/...."
max_length = file_len(csv_path)

data = read_csv(csv_path, sep=';', encoding='utf-8')

With that code I get the max number of rows but I don't know how to get the number of rows in the dataframe, each time one is created. I wanted to use them to make a 0-100% progress bar

score 6 · Answer 1 · edited May 23 '17 at 10:28

6

You can't do this - you would have to modify read_csv function and maybe other functions in pandas.

EDIT:

It seems it can bo done now with chunksize=rows_number.

Using only iterator=True didn't work for me - or maybe it needed more rows.

Thanks to Jeff

Try this

import pandas as pd

from StringIO import StringIO

data = """A,B,C
foo,1,2,3
bar,4,5,6
baz,7,8,9
"""

reader = pd.read_csv(StringIO(data), chunksize=1)

for x in reader:
    print x
    print '--- next data ---'

result:

     A  B  C
foo  1  2  3
--- next data ---
     A  B  C
bar  4  5  6
--- next data ---
     A  B  C
baz  7  8  9
--- next data ---

edited May 23 '17 at 10:28

Community

1
1

answered Jul 14 '14 at 14:09

furas

134,197
12
106
148

well, you sort of could, by iterating over ``read_csv(...chunksize=10)`` (and in the iterating doing something; not efficient though – Jeff Jul 14 '14 at 14:16
@Jeff well, I didn't try it with `chunksize=10` - only with `iterator=True`. But maybe you are right. I test it. – furas Jul 14 '14 at 14:20
those are essentially equivalent (``iterator=True`` implies ``chunksize=1``) – Jeff Jul 14 '14 at 14:24
well I tried with `iterator=True` but i gave me all rows at once. I found even some `issue` from 2013 about it. – furas Jul 14 '14 at 14:26
imho, pandas is good for exploring small(upto 16gb or whatever your RAM size) samples of big data(as of now) and then when you formulate a hypothesis you run a spark job. So if your read_csv takes more than 30mins you need to start thinking of doing something else – devssh Jul 04 '18 at 10:14
@devssh your comment has nothing to do with original question and answer. – furas Jul 04 '18 at 12:19
Ok, so a bit of context on the comment. I was going over progress bars for `pandas.DataFrame.progress_apply` functions via `tqdm` https://stackoverflow.com/questions/18603270/progress-indicator-during-pandas-operations-python and I searched a lot for something like that for `read_csv`. As of now, we have to build our own progress bar in read_csv while iterating over chunksizes. And usually progress bars are meant to indicate how much time is left for it to be done. The reason the OP and people seeing this post want a progress bar is most likely that the csv is too big as compared to aesthetic – devssh Jul 05 '18 at 05:47
@devssh if you have question then create new post. It is not place for talking about different problems. Stackoverflow is not Forum. – furas Jul 05 '18 at 13:53

How do I create a progress bar when a DataFrame is initializing?

1 Answers1