6

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.

Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?

For example: Seek to row 50 read from 50 to 60

next time: seek to row 27 read 27 to 34 and so on

So not only seeking continuesly forward through the file but also backwards.

Thank you a lot

Horst
  • 229
  • 1
  • 5
  • 15
  • can you use an external library? – Espoir Murhabazi May 22 '19 at 13:54
  • Since it's a csv I assume it's characters. Is each row always the same number of characters? Do you always have the same row delimiter? – Matt May 22 '19 at 13:55
  • 2
    Possible duplicate of [Reading specific lines only](https://stackoverflow.com/questions/2081836/reading-specific-lines-only) – Divyanshu Srivastava May 22 '19 at 13:55
  • One row has multiple columns with different amount of characters. But they have the same delimiter. – Horst May 22 '19 at 14:16
  • 4
    To skip reading the whole file and read a specific line, the lines must be fixed length. How else to know what line you are on without counting every newline in the file up to the line you want, and that means reading the whole file from the beginning. CSV is the wrong format for a 20gb file. Convert it to a database. – Mark Tolonen May 22 '19 at 14:22
  • 1
    maybe i'm missing something but isn't it impossible to FIND the line you want without reading the parts of the file that come before? EDIT: unless as the above comment says, the lines are of known length ahead of time. which is not the case here. – Rick May 22 '19 at 14:23
  • @RickTeachey Or they could be indexed in some way, in which case you could use a binary search or hashmap. But at this point you're just building a database. – Swier May 22 '19 at 14:28
  • Unfortunately i can't change the format. The format is given. – Horst May 22 '19 at 14:30

5 Answers5

7

An option would be to use Pandas. For example:

import pandas as pd
# Select file 
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)
Dylan_w
  • 472
  • 5
  • 19
  • 1
    I tried exatly this. Unfotunately using skiprows, pandas somehow iterates also through the csv. Or at least reads in the whole file to create a list with indices. And thats not what i want, since the file is 20gb. – Horst May 22 '19 at 14:02
  • 1
    check my answer @Horst – Espoir Murhabazi May 22 '19 at 14:05
  • 1
    Pandas will somehow have to track the number of lines it's already skipped. If your lines are of constant length, or if they are indexed, you could get around this. Otherwise, the only way is to count the number of newlines which requires reading the file. – Swier May 22 '19 at 14:25
  • I have a group of csv files that always start at line 9, but the last line isn't always the same. How can I set the end row dynamic? There's not always an empty line between the last line I want and the data description I don't. – DataGirl Nov 18 '21 at 18:41
5

You can use chunksize

import pandas as pd

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)
Demont Zhang
  • 184
  • 4
  • 1
    Then you still read the whole file. For 100 lines not a problem but when you read files up to 10000 lines with multiple columns ... – Dylan_w May 22 '19 at 14:01
  • 1
    Normally the size of the file isn't a problem. It's depend on your computer memory. What's your error when you make the code ? And the size file – Demont Zhang May 22 '19 at 14:07
  • 1
    Pandas creates a dataframe for the range i have specified. When i want to read from another range not in this dataframe, i have to create a new dataframe. Thats not optimal. Also pandas does only allow reading forward getting the next chunk, but not going backwards. Like the first i read for example row 70 to 80, then next time row 30 to 40. Thats not possible with pandas. – Horst May 22 '19 at 14:07
  • 1
    Since the file is 20gb the programm just crashes – Horst May 22 '19 at 14:09
4

If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.

As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:

with open(file_name) as f:
    for line,_ in zip(f, range(50)):
        pass
    lines = [line for line,_ in zip(f, range(10))]

...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.

However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:

from itertools import accumulate
with open(file_name) as f:
    cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))

This would allow you to seek to any line number in the file without processing the whole thing ever again:

def seek_line(path, line_num, cum_lens):
    with open(path) as f:
        f.seek(cum_lens[line_num], 0)
        return f.readline()

class LineX:
    """A file reading object that can quickly obtain any line number."""
    def __init__(self, path, cum_lens):
        self.cum_lens = cum_lens
        self.path = path
    def __getitem__(self, i):
        return seek_line(self.path, i, self.cum_lens)

linex = LineX(file_name, cum_lens)
line50 = linex[50]

But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.

Rick
  • 43,029
  • 15
  • 76
  • 119
3

As others are saying the most obvious solution is to use pandas read csv ! The method has a parameter called skiprows:

from the doc there is what is said :

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can have something like this :

import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))

Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial

he said :

The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

You can try this and iterate over the chunk to retrieve only the rows you are looking for.

The function should return true if the row number is in the specified list

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
  • 1
    I tried exatly this. Unfotunately using skiprows, pandas somehow iterates also through the csv. Or at least reads in the whole file to create a list with indices. And thats not what i want, since the file is 20gb. – Horst May 22 '19 at 14:05
  • 1
    Shouldn't your function be `lambda x: x not in range(50,60)` if @Horst only wants rows 50-60? – m13op22 May 22 '19 at 14:19
  • 1
    Specifying chunksize would generally work, but it does not let me go back and forth, only forward to get the next chunk. But in the applycation i have to plot the data and the user should be able to display the next chunk and also the previous chunk. Basically from the beginning to the end and from the end back to the beginning. By doing this, the user should be able to define how big the chunk size should be he want to see at once. – Horst May 22 '19 at 14:26
1

its that easy:

with open("file.csv", "r") as file:
    print(file.readlines()[50:60])
  • 3
    No it isn't, as it reads the whole file first and after it has read the whole file, it returns line [50:60]. And that's what i don't want with a 20gb file. – Horst Jun 03 '19 at 16:46