6

related to why should I make a copy of a data frame in pandas

I noticed that in the popular backtesting library,

def __init__(self, data: pd.DataFrame)
    data = data.copy(False)

in row 631. What's the purpose of such a copy?

ihadanny
  • 4,377
  • 7
  • 45
  • 76
  • how is your question different from the one you linked to? – Chris_Rands Aug 14 '19 at 12:51
  • 2
    @Chris_Rands the question OP links to uses deep copy, while this example uses shallow copy. – Laurens Koppenol Aug 14 '19 at 12:54
  • You make a shallow copy when you want the underlying items to change as the original item is updated, if you want modifications in the copy to be reflected in the original, or if the work you are going to do with the copy will not impact the original and you want to save space by referencing the same underlying items. – CMMCD Aug 14 '19 at 13:06
  • @CMMCD - ok, but why would I make such a copy in the beginning of my function, the way the author of backtesting did? it's weird that he will want the modifications within the library to be reflected in the dataframe the user sent to the library, no? And even if that's the purpose, why not simply use `data`? why bother to call shallow copy? – ihadanny Aug 14 '19 at 18:08

1 Answers1

4

A shallow copy allows you

  1. have access to frames data without copying it (memory optimization, etc.)
  2. modify frames structure without reflecting it to the original dataframe

In backtesting the developer tries to change the index to datetime format (line 640) and adds a new column 'Volume' with np.nan values if it's not already in dataframe. And those changes won't reflect on the original dataframe.

Example

>>> a = pd.DataFrame([[1, 'a'], [2, 'b']], columns=['i', 's'])
>>> b = a.copy(False)
>>> a
    i  s
 0  1  a
 1  2  b
>>> b
    i  s
 0  1  a
 1  2  b
>>> b.index = pd.to_datetime(b.index)
>>> b['volume'] = 0
>>> b
                               i  s  volume
1970-01-01 00:00:00.000000000  1  a       0
1970-01-01 00:00:00.000000001  2  b       0
>>> a
    i  s
 0  1  a
 1  2  b

Of course, if you won't create a shallow copy, those changes to dataframe structure will reflect in the original one.

Viacheslav Zhukov
  • 1,130
  • 9
  • 15
  • 1
    great explanation! just for completeness, if you do `b['i'] += 1` it **will** reflect on the original dataframe `a`. – ihadanny Aug 15 '19 at 15:22
  • I think you confused shallow copies with deep copies. A shallow copy stores references, allowing you to modify the original by modifying the shallow copy. A deep copy makes a xerox of the original that is completely separate. – Clever7- Nov 27 '22 at 14:21