8

I'm operating with huge CSV files (20-25Mln rows) and don't want to split them into smaller pieces for a lot of reasons.

My script reads a file row by row using csv module. I need to now a position (byte number) of the line which will be read on the next iteration (or which just was read).

I tried

>>> import csv
>>> f = open("uscompany.csv","rU")
>>> reader = csv.reader(f)
>>> reader.next()
....
>>> f.tell()
8230

But it seems csv module reads the file by blocks. Since when I keep on iteration I get the same position

>>> reader.next()
....
>>> f.tell()
8230

Any suggestions? Please advice.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • I suspect this isn't possible with `csv`. Depending on the structure of your file, you might be able to just `str.split` on the delimiter and use a regular file object... – mgilson Aug 24 '12 at 12:59
  • 1
    @mgilson Unfortunately I can't use *split* since items contain standard CSV delimiters: quotes, line breaks, etc. – Maksym Polshcha Aug 24 '12 at 13:01

4 Answers4

8

If by "byte position" you mean the byte position as if you had read the file in as a normal text file, then my suggestion is to do just that. Read in the file line by line as text, and get the position within the line that way. You can still parse the CSV data row by row yourself using the csv module:

for line in myfile:
  row = csv.reader([line]).next()

I think it is perfectly good design for the CSV reader to not provide a byte position of this kind, because it really doesn't make much sense in a CSV context. After all, "data" and data are the exact same four bytes of data as far as CSV is concerned, but the d might be the 2nd byte or the 1st byte depending on whether the optional surrounding quotes were used.

John Y
  • 14,123
  • 2
  • 48
  • 72
  • This won't work since items contain standard CSV delimiters: quotes, line breaks, etc. I need byte positions to have fast access to some data within CSV files by *f.seek* rather than reading it line by line using *csv* module. – Maksym Polshcha Aug 24 '12 at 13:23
  • My point is that you can read the file in the normal way *as a text file*. If that involves using `seek()` instead of the line iterator idiom, that's fine. In either case, you can still use the `csv` module to parse just the parts that you need to parse. – John Y Aug 24 '12 at 13:27
  • 1
    Actually I got you. The problem is that every part must be a valid CSV. I mean it can't end in the middle of a row for a column. When I posted the question I looked for a very simple solution and didn't want to code extra parsers etc. – Maksym Polshcha Aug 24 '12 at 13:31
6

Short answer: not possible. The byte position is not available through the csvreader API

1

The csv module does indeed read in blocks using a read-ahead buffer as suggested in responses to this post:

file.tell() inconsistency

I had a similar need to you and generalized my solution for anyone else who might be doing similar things:

https://github.com/loisaidasam/csv-position-reader

Hope it helps!

  • The csv-position-reader relies on readline to initially read the file line by line and then instantiates a new csv reader to then parse each line. This will fail where the data contains line breaks. It is also a lot of overhead for large files. – dcalde Aug 20 '18 at 22:55
  • You're right, I didn't think about line breaks. I've added an issue/tests to fix that at some point: https://github.com/loisaidasam/csv-position-reader/issues/1 I don't see this implementation as "a lot of overhead" unless the "large files" you mentioned have VERY long rows. It iterates through line by line, as you said, only keeping track of the current row and the current file position. I know it's not an optimized solution, but it should work for general use where persisting file position is important. – Loisaida Sam Sandberg Aug 21 '18 at 14:40
0

Unfortunately, https://github.com/loisaidasam/csv-position-reader no longer works in 2022 (Python3 etc).

I got hit by the same issue. Turns out, CSV is not that difficult to parse for this limited use case. In case more future-people come to this (assumes the excel way of escaping):

from typing import Iterator, Tuple

def index_csv(
  input_path: str, read_size: int = 16 * 1024, eol_char: str = "\n"
) -> Iterator[Tuple[int, int]]:
  QUOTE_CHAR = ord('"')
  NEW_LINE_CHAR = ord(eol_char)
  in_quote = False
  char_count = 0 
  row_number = 1 
  last_output_char_count = 0 
  with open(input_path, "rb") as csvf:
    while True:
      chunk = csvf.read(read_size)
      if not chunk:
        break
      for c in chunk:
        char_count += 1
        if in_quote:
          if c == QUOTE_CHAR:
            in_quote = False
        else:
          if c == NEW_LINE_CHAR:
            yield (row_number, char_count)
            last_output_char_count = char_count
            row_number += 1
          elif c == QUOTE_CHAR:
            in_quote = True
  # The last row might not be valid CSV row
  if last_output_char_count != char_count:
    yield (row_number, char_count)

Tested on 10GB CSV file with 100k rows (one column is hugre). Took 5m to index it.

ivankr
  • 1