I have 3 huge CSV files containing climate data, each about 5GB. The first cell in each line is the meteorological station's number (from 0 to about 100,000) each station contains from 1 to 800 lines in each file, which is not necessarily equal in all files. For example, Station 11 has 600, 500, and 200 lines in file1, file2, and file3 respectively. I want to read all the lines of each station, do some operations on them, then write results to another file, then the next station, etc. The files are too large to load at once in memory, so I tried some solutions to read them with minimal memory load, like this post and this post which include this method:
with open(...) as f:
for line in f:
<do something with line>
The problem with this method that it reads the file from the beginning every time, while I want to read files as follows:
for station in range (100798):
with open (file1) as f1, open (file2) as f2, open (file3) as f3:
for line in f1:
st = line.split(",")[0]
if st == station:
<store this line for some analysis>
else:
break # break the for loop and go to read the next file
for line in f2:
...
<similar code to f1>
...
for line in f3:
...
<similar code to f1>
...
<do the analysis to station, the go to next station>
The problem is that each time I start over to take next station, the for loop would start from the beginning, while I want it to start from where the 'Break' occurs at the nth line, i.e. to continue reading the file.
How can I do it?
Thanks in advance
Notes About the solutions below: As I mentioned below at the time I posted my answer, I implemented the answer of @DerFaizio but I found it very slow in processing.
After I had tried the generator-based answer submitted by @PM_2Ring I found it very very fast. Maybe because it depends on Generators.
The difference between the two solutions can be noticed by the numbers of processed stations per minutes which are 2500 st/min for the generator based solution, and 45 st/min for the Pandas based solution. where the Generator based solution is >55 times faster.
I will keep both implementations below for reference. Many thanks to all contributors, especially @PM_2Ring.