0

I have an input file like this:

a
1,100
2,200
3,300
b
1,100,200
2,200,300
3,300,400
c
...

I want to read the file into multiple data frames, with code like this (to simplify the problem, we assume the number of rows for each table is fixed):

import pandas as pd

with open("file.csv", "r") as f:
    while True:
        table_name = f.readline()
        if table_name:
            table_df = pd.read_csv(f, nrows=3)
            # Do other stuff
        else:
            break

My initial expectation was that pd.read_csv(f, nrows=3) consumes only limited number of rows from the input stream and the next f.readline() call would read on. However, it turns out that after the first read_csv call, the stream position of f is set to the end of the file and I can no longer read from the same stream f. My pandas version is 0.25.0. Is this a bug or an expected behaviour? Is there any way to reuse the same input stream to read multiple data frames?

GZ0
  • 4,055
  • 1
  • 10
  • 21
  • Have you tried this answer https://stackoverflow.com/questions/23853553/python-pandas-how-to-read-only-first-n-rows-of-csv-files-in – RootTwo Aug 13 '19 at 14:02
  • Why would you want to reuse the same input stream ? Is your file huge ? If not, you could do `lines = f.readlines()` and simply play with `slicing` and `range(start, [stop, [step]])` – IMCoins Aug 13 '19 at 14:04
  • @RootTwo Thanks. IIUC, each `read_csv` call in that solution creates a new stream and reader, which leads to two problems: (1) it would not work if the stream can only be consumed once (e.g. `sys.stdin`); (2) even with `skiprows`, every time the reader needs to consume the stream from the beginning to count the number of rows to skip, which would lead to quite some overhead. – GZ0 Aug 13 '19 at 14:25
  • @IMCoins The input is not very small but can still fit into the memory for now. It is just that some extra manipulation is needed (as shown in an answer below) to create wrapped streams to be fed into the `pd.read_csv` method. I was trying to see if there is a better way to do that. – GZ0 Aug 13 '19 at 14:55

3 Answers3

3

pandas.read_csv will create a file reader object from filepath_or_buffer argument at once and nrows= param just gives an ability to get a slice from the reader (it can not re-instantiate a new reader from the same file-object)

By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

Basing on your input file format, let's suppose that a table_name is treated as a row with single string without separator , (i.e. a, b). You can achieve the needed result with manually passing a slice of rows to read_csv constructor:

import pandas as pd
import io
from itertools import islice

with open("file.csv", "r") as f:
    dfs = []
    while True:
        table_name = f.readline().strip()
        if table_name and ',' not in table_name:
            data = ''.join(islice(f, 3)).strip()
            table_df = pd.read_csv(io.StringIO(data), sep=',', header=None)
            dfs.append([table_name, table_df])
        else:
            break

# check results
for t_name, df in dfs:
    print('---', t_name)
    print(df) 

Sample output:

--- a
   0    1
0  1  100
1  2  200
2  3  300
--- b
   0    1    2
0  1  100  200
1  2  200  300
2  3  300  400
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • *pandas.read_csv will consume all the content from `filepath_or_buffer` argument and nrows= param just gives an ability to get a slice from entirely read data.* Is this documented? I can't seem to find it in pandas documentation. – GZ0 Aug 13 '19 at 14:58
  • @GZ0, don't worry, I've updated the explanation but the issue of initial approach is still in that `read_csv` can not re-instantiate a new reader from the same file-object – RomanPerekhrest Aug 13 '19 at 15:31
  • Is there any other convenient way to feed the `islice` iterable to `read_csv` without creating a full concatenation -- which doubles the amount of memory needed for loading a table? – GZ0 Aug 13 '19 at 22:58
1

Use the csv module in the Python standard library and use send to indicate the number of rows you want to a generator function:

import csv
import pandas as pd

def csvreader(filename):
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)

        count = yield

        while True:
            rows = []
            for n,row in enumerate(reader):
                rows.append(row)
                if n == count:
                    break

            count = yield(pd.DataFrame(rows))

testfile.csv:

i, j, k
1, 2, 4
2, 4, 8
3, 6, 12
4, 8, 16
. . . 

Set up the generator

x = csvreader(s)
next(x)

Request next 2 rows:

x.send(2)
               #returned DataFrame
                    i   j   k
               0    1   2   4
               1    2   4   8

Request next 3 rows:

x.send(3)
               #returned DataFrame

                    i   j   k
               0    3   6   12
               1    4   8   16
               2    5   10  20

Note the index starts over each time. This could be fixed by specifying a column as the index (add a running counter to each row if needed):

count = yield(pd.DataFrame(rows), index=<some column name>)
RootTwo
  • 4,288
  • 1
  • 11
  • 15
  • 1
    Thanks a lot for the reply. But parsing the input using the `csv` module directly would lose the rich preprocessing functionalities provided by `pd.read_csv`. I prefer a solution that can reuse `pd.read_csv` or similar pandas methods. – GZ0 Aug 13 '19 at 22:45
0

Don't know why I didn't think of this earlier. Setting iterator=True returns an iterator over the csv file. Then use get_chunk() to select how many lines to read:

reader = pd.read_csv(f, iterator=True)
​
reader.get_chunk(2)
                            i   j   k
        returns ->     0    1   2   4
                       1    2   4   8
reader.get_chunk(3)
                            i   j   k
                       2    3   6   12
        returns ->     3    4   8   16
                       4    5   10  20
RootTwo
  • 4,288
  • 1
  • 11
  • 15
  • IIUC, this seems to work only if the input can be seen as a single homogeneous table. The input in my main post was heterogeneous and there are also other information (e.g. table names) between the tables. – GZ0 Aug 13 '19 at 23:13