I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize)
. Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter)
steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x)
by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv
like pd.read_csv()[0,1,2]
but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.