7

I am working with a large number of CSV files, each of which contain a large amount of rows. My goal is to take the data line by line and write it to a database using Python. However because there is a large amount of data I would like tot keep track of how much data has been written. For this I have counted the amount of files being queued and keep on adding one every time a file is complete.

I would like to do something similar for the CSV files and show what row I am on, and how many rows there are in total (for example: Currently on row 1 of X). I can easily get he current row by starting at one and then doing something like: currentRow += 1, however I am unsure how to get the total with out going though the time consuming process of reading line.

Additionally because my CSV files are all stored in zip archives I am currently reading them using the ZipFile module like this:

#The Zip archive and the csv files share the same name
with zipArchive.open(fileName[:-4] + '.csv', 'r') as csvFile:
    lines = (line.decode('ascii') for line in csvFile)
    currentRow = 1

    for row in csv.reader(lines):
        print(row)
        currentRow += 1

Any ideas on how I can quickly get a total row count of a CSV file?

ng150716
  • 2,195
  • 5
  • 40
  • 61
  • Possible duplicate of [Row count in a csv file](http://stackoverflow.com/questions/27504056/row-count-in-a-csv-file) – Rohan Khude Aug 18 '16 at 18:41
  • 1
    Are you trying to avoid reading the file? If not you can use `lines = len(open(csv).readlines())` – double_j Aug 18 '16 at 18:42
  • @double_j that is correct, I am trying to avoid looping through every line in order to figure out the total row count. However because I am reading from CSV files stored in zip archive your solution may not be applicable. I have updated my code to explain my process. – ng150716 Aug 18 '16 at 19:14
  • 2
    That's literally the only (reasonable) way you can get a count of the number of rows in a csv file - read it with a csv reader. – Wayne Werner Aug 18 '16 at 19:19
  • I think you should check out [getting line count cheaply in Python](http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python). This has a lot of relevant good answers. – Sahil M Aug 18 '16 at 19:29

3 Answers3

6

You can't count the lines in a file without opening it and counting the lines.

If your files are so large that counting lines with row_count = sum(1 for row in file_handle) is not practical, and reading the whole file into memory is a non-starter, a different approach may be needed.

It is quite easy to get the length of a file in bytes (How to check file size in python?). If you then count the length (in bytes) of each line as you read it, you can then report "Currently on byte 13927 of 4972397 (2.8%)"

For files stored in zip, Zipfile.getinfo(name).file_size is the size of the uncompressed file.

Community
  • 1
  • 1
James K
  • 3,692
  • 1
  • 28
  • 36
  • 2
    Technically, for a csv file this could produce a wrong error, if there's a newline inside a quoted string. – Wayne Werner Aug 18 '16 at 19:20
  • @WayneWerner It could, however there is no way to get the number of csv lines in a file without reading it. This is an attempt at a next best thing. The error would probably not be critical – James K Aug 18 '16 at 19:23
  • It's probably not much of a win. I'd be curious to see what the timeit difference is for small and large files - having that comparison would make this answer more useful – Wayne Werner Aug 18 '16 at 19:26
5

If you just want to show some progress, you could try using tqdm.

from tqdm import tqdm

with zipArchive.open(fileName[:-4] + '.csv', 'r') as csvFile:
    lines = [line.decode('ascii') for line in csvFile]
    currentRow = 1

    for row in tqdm(csv.reader(lines), total=len(lines)):
        print(row)
        currentRow += 1

This should give you a sleek progress bar with virtually no effort on your part.

Lily Mara
  • 3,859
  • 4
  • 29
  • 48
  • Wow, this is really simple. However when I try to use it the way you mentioned above (`for row in tqdm(csv.reader(lines))`) it does not show a bar but rather a series of numbers going up: like this: `3589382it [00:20, 171812.07it/s]`. I think it is because it is using the variable `lines` which is a generator, and I doubt it can do something like `len(lines`, any ideas on how to solve this? – ng150716 Aug 18 '16 at 20:27
  • @ng150716 It looks like tqdm will not give you a progress bar unless it know the total length. For that reason, we need to store the lines as a list and get its length. There may be a more optimal way to do this, but if you want to know the total length of a generator, you _must_ run all the way through it. – Lily Mara Aug 18 '16 at 20:39
  • Ok, so I decided to rewrite a bit of my program and instead of reading the csv directly from the zip archive. I extract it first then, read it which allows me to create a list from the reader, which creates a count. – ng150716 Aug 18 '16 at 22:25
  • The more optimal way is to just walk through your whole CSV file line by line first (you can still show tqdm but without a total, just to show how the preprocessing is progressing), and then with that total you can now walk through your CSV file again to insert each line into your database and with a tqdm bar with a correct total (thus showing a real progress bar). – gaborous Sep 02 '16 at 18:59
3

Following works for me:

  • install tqdm: pip install tqdm
  • code to read a csv and show progress:
    with open('./data.csv') as csv_file:
        lines = len(csv_file.readlines())

    with open('./data.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        header = next(csv_reader)

        for row in tqdm(csv_reader, total=lines):
            print(row)
Manuel Schmitzberger
  • 5,162
  • 3
  • 36
  • 45