1

My DF is very large is there a nice way (not for loop) to modify some values within the DF and save every N steps e.g.

def modifier(x):
   x = x.split() # more complex logic is applied here
   return x

df['new_col'] = df.old_col.apply(modifier)

Is there a nice way to add to modifier function some code that every 10,000 rows

df.to_pickle('make_copy.pickle')

will be called?

2 Answers2

1

For saving every so-many rows, the issue is making sure that the edge case is properly handled (as the last section might not be a full-size section). Using an approach discussed here then you could do something along the following lines. Although there is a loop it is only for every section. Note if you save every section then you need a mechanism for saving each under a new name (or else append to a List of DFs and save that). Note that it is efficient because it uses the default numerical index for splitting so this needs to be in-place or replaced using reset_index. If this is not available or you want to split into chunks without looping then you could explore numpy array_split; but the same looping would still be required for each chunk to save to a file.

from more_itertools import sliced   # this module might need to be installed using pip
SLICE_SIZE = 10000

slices = sliced(range(len(df)), SLICE_SIZE)

for index in slices:
    df_slice = df.iloc[index]
    print(df_slice)          # or do anything you want with the section of the DF such as save it as required
user19077881
  • 3,643
  • 2
  • 3
  • 14
  • Thanks but I was hoping for something without for loop solution... is there something that can be called from the inside of `modifier` function? – TommyLeeJones Feb 08 '23 at 13:39
  • It only loops once every 10,000 rows and the time penalty will be negligible. It is not clear to me how you expect the modifier function to work with .apply; this applies a function along a DF axis so modifier would be sent each DF row in turn; what is your split supposed to do with the single row when it gets it? Could you provide a reduced example with just a few rows to be split into chunks of say 2 rows and show what you expect to end up with? – user19077881 Feb 08 '23 at 13:57
  • I was thinking whether there is something like self.to_pickle() inside the modifer. – TommyLeeJones Feb 08 '23 at 15:09
  • But the modifier function only gets one row at a time and that's not what you want to pickle. You need to section the DF as in my code and then do whatever you want to modify the values and then pickle that section with a new file name generated within the loop. There is no need or value in avoiding a loop; anyway almost all time will be used writing the files. Try it! – user19077881 Feb 08 '23 at 15:43
  • "But the modifier function only gets one" - I know.. that's why I am posting it on Stack... Loop is a problem if you need to do it thousands of time. – TommyLeeJones Feb 09 '23 at 16:27
  • I don't see the loop as a problem as it only loops once for each 10,000 rows. If you have say 100M rows then that is 10k loops which takes a few ms **IN TOTAL** not each loop; it's whats inside the loop that will take the time and that's writing out each 10,000 section using pickle and that's inevitable as it's what you want to do anyway. I've suggested an efficient way to do it but if you have something better then good luck with it. – user19077881 Feb 09 '23 at 16:54
0

I wanted to achieve something similar - here is my approach:

# Import packages
from more_itertools import sliced
import pandas as pd

# Create DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})

# Split/slice DataFrame into chunks of 4
slices = sliced(seq=range(len(df)), n=4)

# Create empty DataFrame
data = pd.DataFrame()

# Apply function and save every N steps
for index in slices:
    chunk = df.iloc[index].copy()
    chunk['new_column'] = index # Apply function/transformation here
    data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
    data.to_pickle(path='df.pkl')
User
  • 281
  • 2
  • 11
  • Thank you for your effort. My constrain is that e.g. I have 512GB of RAM and the DataFrame is ~450GB. In addition, it takes a week to process entire DataFrame. That's why I wanted to save every N steps. In case of server failure I don't have to restart the job. – TommyLeeJones Apr 20 '23 at 10:18
  • @TommyLeeJones in my case I am running a transformation that is really slow to execute. In my case, I split the transformation into chunks of 50 rows and save afterwards. If I want to continue another day, I continue running the code by adding the logic to not execute it in the rows where the code was already executed. – User Apr 20 '23 at 12:18