0

I'm trying to refactor the following pseudo-code logic to take the function from reading and writing to files to be all in memory using pandas, but am confused by how the with function operates as compared to a loop over a pandas dataframe.

This is the code I would like to refactor:

results = []
with open('data.csv', 'rt') as ins:
    next(ins)  # drop header
    a1, b1, c1 = next(ins).strip().split(',')
    for i, line in enumerate(ins, 2):
        a2, b1, c1 = line.strip().split(',')
        ...
        results.append(dummy_func(a1 b1, c1))
    else:
        results.append(dummy_func(a1 b1, c1))

Is this the in memory equivalent, in particular I'm not sure if with ins are lines in the file, do I need both itertuples, and on a side note is itertuples the best thing to use here, faster than iterrows for example?

import pandas as pd
df = pd.read_csv('data.csv', sep=',')
results = []
for row in df.itertuples():
    a1, b1, c1 = row.a, row.b, row.c
    for row2 in df.loc[2:].itertuples():
        a1, b1, c1 = row2.a, row2.b, row2.c
        ...
        result.append(dummy_func(a1, b1, c1))
    else: 
        result.append(dummy_func(a1, b1, c1))
William Grimes
  • 675
  • 2
  • 11
  • 29
  • 2
    `with...` is a context manager for opening the file. Pandas `read_csv` has built-in file context management, so there's no need to use `with`. Beyond that, it's nearly impossible to say what the best way to achieve your goal is, because you haven't described it. In general, there are few reasons you should ever need to iterate over a pandas dataframe, but that's entirely context dependent. Please see [How to creat good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide some sample input and output for better help – G. Anderson Jul 19 '19 at 18:37
  • @G.Anderson thanks very much, please could you help me to understand what that means that `read_csv` has built in file context management, or point me to somewhere for more information on that? – William Grimes Jul 19 '19 at 18:51
  • Why are you replacing your efficient version with an inefficient pandas version? Just use the `csv` module. – juanpa.arrivillaga Jul 19 '19 at 19:11
  • 1
    https://stackoverflow.com/questions/3693771/trying-to-understand-python-with-statement-and-context-managers – G. Anderson Jul 19 '19 at 19:24
  • @juanpa.arrivillaga this is part of a pipeline and I don't want to keep reading and writing intermediate files – William Grimes Jul 19 '19 at 19:55
  • @WilliamGrimes you wouldn't have to, any more than you would with pandas. In any case, both `itertuples` and `iterrows` will be slow compared to a `csv` based solution. And less memory efficient (significantly so) – juanpa.arrivillaga Jul 19 '19 at 20:49

1 Answers1

0

Okay I misunderstood the with statement this is the answer

import pandas as pd
df = pd.read_csv('data.csv', sep=',')
results = []
a1, b1, c1 = row.a, row.b, row.c
for row2 in df.loc[1:].itertuples():
    a1, b1, c1 = row2.a, row2.b, row2.c
    ...
    result.append(dummy_func(a1, b1, c1))
else: 
    result.append(dummy_func(a1, b1, c1))
William Grimes
  • 675
  • 2
  • 11
  • 29