1

I am reading a large csv file in chunks as I don’t have enough memory to store. I would like to read its first 10 rows (0 to 9 rows), skip the next 10 rows(10 to 19), then read the next 10 rows( 20 to 29 rows), again skip the next 10 rows(30 to 39) and then read rows from 40 to 49 and so on. Following is the code I am using:

#initializing n1 and n2 variable  
n1=1
n2=2
#reading data in chunks
for chunk in pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=list(range(  ((n1*10)+1), ((n2*10) +1) ))):
    sample_chunk=chunk
   #displaying the  sample_chunk
   print(sample_chunk)
   #incrementing n1
    n1=n1+2
   #incrementing n2
    n2=n2+2

However, the code does not work as I assume I have designed. It only skip rows from 10 to 19 (i.e: It reads rows from 0 to 9, skip 10 to 19, then reads 20 to 29, then again read 30 to 39, then again read 40 to 49, and keep on reading all the rows). Please help me identify what I am doing wrong.

Noor
  • 126
  • 2
  • 8

2 Answers2

1

With your method, you need to define the all the skiprows in the time of initialising the pd.read_csv which you can do so,

rowskips = [i for x in range(1,int(lengthOfFile/10),2) for i in range(x*10, (x+1)*10)]

with lengthOfFile being the length of the file.

Then for pd.read_csv

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=rowskips)

From the documentation :

skiprows : list-like, int or callable, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

So you can pass list, int or callable,

int -> it skips the given lines at the start of the file
list -> it skips the line number given in list
callable -> it evaluates the line number with the callable and then decides to skip or not.

You were passing list that specifies at the time of initiation, the lines to skip. You cannot update it again. Another way might to be to pass a callable, lamda x: x in rowskips and it will evaluate if a row fits the condition to skip.

jkhadka
  • 2,443
  • 8
  • 34
  • 56
  • 2
    your program only keeps row from 0-9 and skips all other – Nihal Feb 19 '19 at 11:46
  • @Nihal yeah i missed the `2` in `range` – jkhadka Feb 19 '19 at 11:47
  • 2
    still wrong, lets say i have `length=400` then it will go till `4000` – Nihal Feb 19 '19 at 11:49
  • 1
    @Nihal thanks for that, yeah, now this should be fine. i overlooked the `*10` in the second `for` – jkhadka Feb 19 '19 at 11:53
  • @hadik Thanks. Can you please explain why do we need to define all skiprows in the time of initialising the pd.read_csv, I mean I could not comprehend your point in the light of the documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). – Noor Feb 19 '19 at 12:14
  • 1
    @Noor added the explanation. – jkhadka Feb 19 '19 at 13:37
1

code:

ro = list(range(0, lengthOfFile + 10, 10))
d = [j + 1 for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
# print(ro)
print(d)

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=d)

for example:

lengthOfFile = 100
ro = list(range(0, lengthOfFile + 10, 10))
d = [j for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
print(d)

output: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Nihal
  • 5,262
  • 7
  • 23
  • 41
  • Thanks @Nihal . I have implemented your code, however, it reads rows 0,1,2,3,4,5,6,7,8,**19** instead of 0-9, then reads rows 20,21,22,23,24,25,26,27,28,**39** instead of 20-29, then reads 40,41,42,43,44,45,46,47, 48,**59** instead of 40-49 and so on. – Noor Feb 19 '19 at 12:42
  • 1
    updated my answer, just use `j + 1` for creating `d` – Nihal Feb 19 '19 at 12:51