0

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.

To explain the problem, here is a minimal version:

The CSV (save it as test1.csv, for example)

1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24

Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.

I achieved it using the following:

import pandas as pd

list_of_ids = dict()  # this will contain all "id"s and the start and end row index for each id

# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
    #print(df_chunk)

    # In each chunk, get the unique id values and add to the list
    for i in df_chunk['id'].unique().tolist():
        if i not in list_of_ids:
            list_of_ids[i] = []  # initially new values do not have the start and end row index

    for i in list_of_ids.keys():        # ---------MARKER 1-----------
        idx = df_chunk[df_chunk['id'] == i].index    # get row index for particular value of id
        
        if len(idx) != 0:     # if id is in this chunk
            if len(list_of_ids[i]) == 0:      # if the id is new in the final dictionary
                list_of_ids[i].append(idx.tolist()[0])     # start
                list_of_ids[i].append(idx.tolist()[-1])    # end
            else:                             # if the id was there in previous chunk
                list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]]    # keep old start, add new end
            
            #print(df_chunk.iloc[idx, :])
            #print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])

print(list_of_ids)

skip = None
rows = None

# Now from the file, I will read only particular id group using following
#      I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
    print('Data for id: {}'.format(id))
    skip, rows = se[0], (se[-1] - se[0]+1)
    for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
        print(df_chunk)

Truncated output from my code:

{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
   id  val
0   1   10
1   1   11
   id  val
2   1   12
Data for id: 2
   id  val
0   2   13
1   2   14
   id  val
2   2   15
3   2   16
Data for id: 3
   id  val
0   3   17
1   3   18

What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?

(I am looking for complete code in answer)

anurag
  • 1,715
  • 1
  • 8
  • 28
  • So you want to first read all ones, all twos and so on?, Also what is Marker 1? – Dani Mesejo Jan 08 '21 at 11:37
  • Yes, in the actual dataset, all `1`s (and others) may have many rows. I want to use a limited chunksize. **MARKER 1** is in the code I shared: `for i in list_of_ids.keys()` – anurag Jan 08 '21 at 11:39
  • So you only want the first 5 rows (of 1s) or all the rows (1s) to be load on memory? – Dani Mesejo Jan 08 '21 at 11:41
  • To confirm, even when reading all `1`s etc., I might need to use chunked reading, but, I want to ensure that for a particular id, I can read all the rows associated with it! – anurag Jan 08 '21 at 11:42

1 Answers1

1

I suggest you use itertools for this, as follows:

import pandas as pd
import csv
import io

from itertools import groupby, islice
from operator import itemgetter


def chunker(n, iterable):
    """
    From answer: https://stackoverflow.com/a/31185097/4001592
    >>> list(chunker(3, 'ABCDEFG'))
    [['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
    """
    iterable = iter(iterable)
    return iter(lambda: list(islice(iterable, n)), [])


chunk_size = 5
with open('test1.csv') as csv_file:
    reader = csv.reader(csv_file)
    for _, group in groupby(reader, itemgetter(0)):
        for chunk in chunker(chunk_size, group):
            g = [','.join(e) for e in chunk]
            df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
            print(df)
            print('---')

Output (partial)

   0   1
0  1  10
1  1  11
2  1  12
---
   0   1
0  2  13
1  2  14
2  2  15
3  2  16
---
   0   1
0  3  17
1  3  18
2  3  19
3  3  20
---
...

This approach will read first in groups by column 1:

for _, group in groupby(reader, itemgetter(0)):

and each group will be read in chunks of 5 rows (this can be change using chunk_size):

for chunk in chunker(chunk_size, group):

The last part:

g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')

creates a suitable string to be pass to pandas.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Just another part I just updated in my question, is it possible to read the groups themselves in chunks? Have a look at the last loop in my code to see what I mean! Also, `itemgetter(0)` this is used to select the column, right? – anurag Jan 08 '21 at 12:04
  • 1
    @anurag yes, itemgetter(0) is to select the column to group, and chunker is already reading the groups in chunks. It cannot be seen in the example output because the chunk size is 5 – Dani Mesejo Jan 08 '21 at 12:05
  • The function is explained in the second part of the answer [here](https://stackoverflow.com/a/31170795/14739759) – anurag Jan 08 '21 at 13:20