0

I have a very massive csv file. I would like to get one row, every 3 rows, in a dataframe. It is more or less like resampling the csv.

Let's say, I have a csv file like this :

4  5
9  2
3  7
1  5
2  4
9  10

And I want my dataframe to be :

4  5
1  5

If I read the csv and then drop 1 row every 3 rows, it is useless because it is taking too much time. Does someone have an idea ? :) (By the way, I am using Python)

Cheers

Victor Lee
  • 2,467
  • 3
  • 19
  • 37
Ani
  • 3
  • 1
  • 1
    Relevant topic with proposed solutions, and why they (don't) work: https://stackoverflow.com/questions/56258448/only-read-certain-rows-in-a-csv-file-with-python – user_cr Apr 20 '22 at 12:13
  • https://stackoverflow.com/questions/620367/how-to-jump-to-a-particular-line-in-a-huge-text-file#620492 Did you tried some proposition in this thread ? – LittlePanic404 Apr 20 '22 at 13:58

2 Answers2

0

If I understood correctly, you want to cut your read time to (at most) 1/3 of the total time. Pandas has many function to customize your csv read, but none will avoid reading (despite then discarding) your whole file, since it is stored on contiguous blocks on your disk.

What I think is that if your constraint is time (and not memory), a 1/3 reduction of the time is not going to be enough in any case, of any size of your file. What you can do is:

  • read the whole csv
  • filter it keeping just 1 row each 3
  • store the result in an other file
  • on following runs, read the filtered csv
rikyeah
  • 1,896
  • 4
  • 11
  • 21
0

You need to create a csv reader object first, then create a generator which will read only nth item from the iterator, then use it as dataframe source. By doing it in that way you will avoid excessive memory usage.

import csv
import pandas as pd

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = (x for i, x in enumerate(reader) if i % 3 == 0)
df = pd.Dataframe(data)

It looks like there is also a simpler way: passing lambda to skiprows argument of read_csv

import pandas as pd

fn = lambda x: x % 3 != 0
df = pd.read_csv('file.csv', skiprows=fn)
svfat
  • 3,273
  • 1
  • 15
  • 34
  • Hi thank you a lot for your help ! Actually, yes I want to avoid massive file storage. I also need to work on this file for data analysis but it is too big, it will take me a life. I have tried this for resampling : with open(filepath) as csv_file: csv_reader = csv.reader(csv_file) line_count = 0 for row in csv_reader: if line_count % 3 == 0: df.append(row) line_count += 1 else: line_count += 1 – Ani Apr 20 '22 at 12:50
  • it seems to work but the process is still too long. I tried your proposition, and I got the error : TypeError: '_csv.reader' object is not subscriptable – Ani Apr 20 '22 at 12:53
  • my assumption was wrong, it looks like csv.reader doesn't support steps, I've changed the code and added 2nd way to do filtering – svfat Apr 20 '22 at 13:18
  • @Ani Happy to help, and welcome to Stack Overflow. If this answer or any other one solved your issue, please mark it as accepted – svfat Apr 21 '22 at 14:21