I have large piece of data that is problematic to load entirely to memory so I have decided to read it row-by-row, picking desired data, making transformations etc. and then clearing variables and pick another row.
It works fine while I am using csv.reader.
source_file = open(path_to_source_file, encoding="utf8", newline='')
spamreader = csv.reader(source_file , delimiter=";", quotechar='"')
for id, row in enumerate(spamreader):
variable1 = row[0]
variable2 = row[1]
variable3 = row[2]
variable4 = row[3]
...
variable25 = row[26]
...
if id%1000000==0: print(f"Number of processed rows: {id}")
However, I am afraid that in the future some columns may be added/removed/swapped and adressing them by row[id] is not "safe" option.
I was thinking if there is possibility to load them with pandas because dataframe allows addressing by columname:
for id, row in df.iterrows():
variable1 = row['column0']
variable2 = row['column1']
variable3 = row['column2']
variable4 = row['column3']
...
variable25 = row['column25']
...
if id%1000000==0: print(f"Number of processed rows: {id}")
However I wasn't able to make it work in comparable speed.
I have tried use chunksize=1, but even "pass" took very long time to execute:
with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=1) as reader:
for chunk in reader:
pass
I've decided also to try nested fors and picking bigger chunks and then iterating over rows:
with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=10000) as reader:
for chunk in reader:
for id, row in chunk.iterrows():
pass
but this is also really slow if compared to read_csv version.
So,to summarize my question: Is there any way to load data from csv file with pandas row-by-row to get comparatable speed to csv.reader?