3

Initial Question:

I'm looping through a couple of thousand pickle files with Python Pandas DataFrames in it which vary in the number of rows (between aprox. 600 and 1300) but not in the number of collumns (636 to be exact). Then I transform them (exactly the same tranformations to each) and append them to a csv file using the DataFrame.to_csv() method.

The to_csv code excerpt:

if picklefile == '0000.p':
    dftemp.to_csv(finalnormCSVFile)
else:
    dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

What bothers me is that it starts off pretty fast but performance decreases exponentially, I kept a processing time log:

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058
count = 0
time: 0:00:00

2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697

2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128

2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459

2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692

2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618

2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877

2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646

2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114

2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602

2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577

2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802

2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343

2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335

2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590

2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412

2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339

2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204

2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146

2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948

2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383

2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974

2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205

2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652

2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415

2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070

2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234

2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679

2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979

2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212

2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924

2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472

2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556

2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356

count = 3324
end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784

Update #1:

I did as you suggested @Alexander but it has certainly to do with the to_csv() mehod:

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410
count = 0
time: 0:00:00

2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631

2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541

2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016

2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237

2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361

2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384

2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633

2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624

2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006

2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800

2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756

2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996

2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653

2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465

2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006

2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323

2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602

2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465

2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692

2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336

2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537

2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320

2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145

2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342

2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319

2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824

2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567

2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148

2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286

2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294

2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584

2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725

2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513

count = 3324
end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772

And as requested, the source code:

import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime

# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
    return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:    
pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:
head = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
    ]
exclude = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
    ]

path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'

count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")


for picklefile in pickleFiles: 
    if count%100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    for i in range(0, len(df)):
        df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
    df.index = df.ConcatIndex


    #DataFrame Normalization:
    dftemp = df.very_deep_copy()
    for string in head:
        if string in exclude:
            if string != 'ConcatIndex':
                dftemp.drop(string, axis=1, inplace=True)
        else:
            if 'Real ' in string:
                max = pd.DataFrame.max(df[string.strip('Real ')])
            elif 'child' in string:
                max = pd.DataFrame.max(df[string.strip('child')+'desc'])
            else:
                max = pd.DataFrame.max(df[string])

            if max != 0:
                dftemp[string] = dftemp[string]/max

    dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

    count += 1

print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")

end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')

Update #2:

As suggested I executed the command %prun %run "./DataSetNormalization.py" for the first couple of hundred picklefiles and the result is as followed:

   136373640 function calls (136342619 primitive calls) in 1018.769 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220  667.069    3.032  667.069    3.032 {method 'close' of '_io.TextIOWrapper' objects}
     1540   42.046    0.027   46.341    0.030 {pandas.lib.write_csv_rows}
      219   34.886    0.159   34.886    0.159 {built-in method collect}
     3520   16.782    0.005   16.782    0.005 {pandas.algos.take_2d_axis1_object_object}
    78323    9.948    0.000    9.948    0.000 {built-in method empty}
 25336892    9.645    0.000   12.635    0.000 {built-in method isinstance}
  1433941    9.344    0.000    9.363    0.000 generic.py:1845(__setattr__)
221051/220831    7.387    0.000  119.767    0.001 indexing.py:194(_setitem_with_indexer)
   723540    7.312    0.000    7.312    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   273414    7.137    0.000   20.642    0.000 internals.py:2656(set)
   604245    6.846    0.000    6.850    0.000 {method 'copy' of 'numpy.ndarray' objects}
     1760    6.566    0.004    6.566    0.004 {pandas.lib.isnullobj}
   276274    5.315    0.000    5.315    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1719244    5.264    0.000    5.266    0.000 {built-in method array}
  1102450    5.070    0.000   29.543    0.000 internals.py:1804(make_block)
  1045687    5.056    0.000   10.209    0.000 index.py:709(__getitem__)
        1    4.718    4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
   602485    4.575    0.000   15.087    0.000 internals.py:2586(iget)
   441662    4.562    0.000   33.386    0.000 internals.py:2129(apply)
   272754    4.550    0.000    4.550    0.000 internals.py:1291(set)
   220883    4.073    0.000    4.073    0.000 {built-in method charmap_encode}
  4781222    3.805    0.000    4.349    0.000 {built-in method getattr}
    52143    3.673    0.000    3.673    0.000 {built-in method truediv}
  1920486    3.671    0.000    3.672    0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
  1096730    3.513    0.000    8.370    0.000 internals.py:3035(__init__)
   875899    3.508    0.000   14.458    0.000 series.py:134(__init__)
   334357    3.420    0.000    3.439    0.000 {pandas.lib.infer_dtype}
  2581268    3.419    0.000    4.774    0.000 {pandas.lib.values_from_object}
  1102450    3.036    0.000    6.110    0.000 internals.py:59(__init__)
   824856    2.888    0.000   45.749    0.000 generic.py:1047(_get_item_cache)
  2424185    2.657    0.000    3.870    0.000 numeric.py:1910(isscalar)
   273414    2.505    0.000    9.332    0.000 frame.py:2113(_sanitize_column)
  1646198    2.491    0.000    2.880    0.000 index.py:698(__contains__)
   879639    2.461    0.000    2.461    0.000 generic.py:87(__init__)
   552988    2.385    0.000    4.451    0.000 internals.py:3565(_get_blkno_placements)
   824856    2.349    0.000   51.282    0.000 frame.py:1655(__getitem__)
   220831    2.224    0.000   21.670    0.000 internals.py:460(setitem)
   326437    2.183    0.000   11.352    0.000 common.py:1862(_possibly_infer_to_datetimelike)
   602485    2.167    0.000   16.974    0.000 frame.py:1982(_box_item_values)
   602485    2.087    0.000   23.202    0.000 internals.py:2558(get)
   770739    2.036    0.000    6.471    0.000 internals.py:1238(__init__)
   276494    1.966    0.000    1.966    0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076    1.935    0.000    1.972    0.000 {built-in method len}
   220831    1.924    0.000   76.647    0.000 indexing.py:372(setter)
      220    1.893    0.009    1.995    0.009 {built-in method load}
  1920486    1.855    0.000    8.198    0.000 index.py:1173(get_loc)
   112860    1.828    0.000    9.607    0.000 common.py:202(_isnull_ndarraylike)
   602485    1.707    0.000    8.903    0.000 series.py:238(from_array)
   875899    1.688    0.000    2.493    0.000 series.py:263(_set_axis)
     3300    1.661    0.001    1.661    0.001 {method 'tolist' of 'numpy.ndarray' objects}
  1102670    1.609    0.000    2.024    0.000 internals.py:108(mgr_locs)
  4211850    1.593    0.000    1.593    0.000 {built-in method issubclass}
  1335546    1.501    0.000    2.253    0.000 generic.py:297(_get_axis_name)
   273414    1.411    0.000   37.866    0.000 frame.py:1994(__setitem__)
   441662    1.356    0.000    7.884    0.000 indexing.py:982(_convert_to_indexer)
   220831    1.349    0.000  131.331    0.001 indexing.py:95(__setitem__)
   273414    1.329    0.000   23.170    0.000 generic.py:1138(_set_item)
   326437    1.276    0.000    6.203    0.000 fromnumeric.py:2259(prod)
   274734    1.271    0.000    2.113    0.000 shape_base.py:60(atleast_2d)
   273414    1.242    0.000   34.396    0.000 frame.py:2072(_set_item)
   602485    1.183    0.000    1.979    0.000 generic.py:1061(_set_as_cached)
   934422    1.175    0.000    1.894    0.000 {method 'view' of 'numpy.ndarray'objects}
     1540    1.144    0.001   58.217    0.038 format.py:1409(_save_chunk)
   220831    1.144    0.000    9.198    0.000 indexing.py:139(_convert_tuple)
   441662    1.137    0.000    3.036    0.000 indexing.py:154(_convert_scalar_indexer)
   220831    1.087    0.000    1.281    0.000 arrayprint.py:343(array2string)
  1332026    1.056    0.000    3.997    0.000 generic.py:310(_get_axis)
   602485    1.046    0.000    9.949    0.000 frame.py:1989(_box_col_values)
      220    1.029    0.005    1.644    0.007 internals.py:2429(_interleave)
   824856    1.025    0.000   46.777    0.000 frame.py:1680(_getitem_column)
  1491578    1.022    0.000    2.990    0.000 common.py:58(_check)
   782616    1.010    0.000    3.513    0.000 numeric.py:394(asarray)
   290354    0.988    0.000    1.386    0.000 internals.py:1950(shape)
   220831    0.958    0.000   15.392    0.000 generic.py:2101(copy)
   273414    0.940    0.000    1.796    0.000 indexing.py:1520(_convert_to_index_sliceable)
   220831    0.920    0.000    1.558    0.000 common.py:1110(_possibly_downcast_to_dtype)
   220611    0.914    0.000    0.914    0.000 {pandas.lib.is_bool_array}
   498646    0.906    0.000    0.906    0.000 {method 'clear' of 'dict' objects}
   715345    0.848    0.000   13.083    0.000 common.py:132(_isnull_new)
   452882    0.824    0.000    1.653    0.000 index.py:256(__array_finalize__)
   602485    0.801    0.000    0.801    0.000 internals.py:208(iget)
    52583    0.748    0.000    2.038    0.000 common.py:1223(_fill_zeros)
   606005    0.736    0.000    6.755    0.000 internals.py:95(make_block_same_class)
   708971    0.732    0.000    2.156    0.000 internals.py:3165(values)
  1760378    0.724    0.000    0.724    0.000 internals.py:2025(_get_items)
   109560    0.720    0.000    6.140    0.000 nanops.py:152(_get_values)
   220831    0.718    0.000   11.017    0.000 internals.py:2395(copy)
   924669    0.712    0.000    1.298    0.000 common.py:2248(_get_dtype_type)
  1515796    0.698    0.000    0.868    0.000 {built-in method hasattr}
   220831    0.670    0.000    4.299    0.000 internals.py:435(copy)
   875899    0.661    0.000    0.661    0.000 series.py:285(_set_subtyp)
   220831    0.648    0.000    0.649    0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
   452882    0.640    0.000    0.640    0.000 index.py:218(_reset_identity)
   715345    0.634    0.000    1.886    0.000 {pandas.lib.isscalar}
     1980    0.626    0.000    1.172    0.001 internals.py:3497(_merge_blocks)
   220831    0.620    0.000    2.635    0.000 common.py:1933(_is_bool_indexer)
   272754    0.608    0.000    0.899    0.000 internals.py:1338(should_store)
   220831    0.599    0.000    3.463    0.000 series.py:482(__getitem__)
   498645    0.591    0.000    1.497    0.000 generic.py:1122(_clear_item_cache)
  1119390    0.584    0.000    1.171    0.000 index.py:3936(_ensure_index)
   220831    0.573    0.000    1.883    0.000 index.py:222(view)
   814797    0.555    0.000    0.905    0.000 internals.py:3086(_values)
    52583    0.543    0.000   15.545    0.000 ops.py:469(wrapper)
   220831    0.536    0.000    3.760    0.000 internals.py:371(_try_cast_result)
   228971    0.533    0.000    0.622    0.000 generic.py:1829(__getattr__)
   769651    0.528    0.000    0.528    0.000 {built-in method min}
   224351    0.509    0.000    2.030    0.000 generic.py:1099(_maybe_update_cacher)
   ...

I will rerun it for confirmation but looks like it certainly has something to do with pandas' to_csv() method, because most of the run time is used on io and the csv writer. Why is it having this effect? Any suggestions?

Update #3:

Well, I did a full %prun test and indeed almost 90% of the time spent is used on {method 'close' of '_io.TextIOWrapper' objects}. So I guess here's the problem... What do you guys think?

My questions here are:

  1. What originates here the decrease in performance?
  2. Does pandas.DataFrames.to_csv() append mode load the whole file each time it writes to it?
  3. Is there a way to enhance the process?
CMPSoares
  • 4,175
  • 3
  • 24
  • 42
  • 1
    It's going to be hard to debug this without sample code that actually demonstrates the problem while writing to a file. – BrenBarn Mar 26 '15 at 04:54
  • 1
    Are you sure the time lag is from the .to_csv write and not the DataFrame process (for which you didn't provide any code). Try skipping the .csv write and just print the time to see if you still have the same performance issue. – Alexander Mar 26 '15 at 04:56
  • @Alexander added full source code will test your suggestion in a couple of minutes. – CMPSoares Mar 26 '15 at 05:15
  • 1
    Depending on memory constraints, have you tried to concatenate in memory, for example, every 100 DataFrames and then save the batch DataFrame to .csv? For example, if you appended the results of every data frame to a list and then concatenated the list to a data frame before exporting. – Alexander Mar 26 '15 at 07:30
  • Yes @ Alexander I did. But would take even longer. The final csv has almost 6GB in size. – CMPSoares Mar 26 '15 at 16:51

2 Answers2

3

In these kind of situation you should profile your code (to see which function calls are taking the most time), that way you can check empirically that it is indeed slow in the read_csv rather than elsewhere...

From looking at your code: Firstly there's a lot of copying here and a lot of looping (not enough vectorization)... everytime you see looping look for a way to remove it. Secondly, when you use things like zfill, I wonder if you want to_fwf (fixed width format) rather than to_csv?

Some sanity testing: Are some files are significantly bigger than others (which could lead to you hitting swap)? Are you sure the largest files are only 1200 rows?? Have your checked this? e.g. using wc -l.

IMO I think it unlikely to be garbage collection.. (as was suggested in the other answer).


Here are a few improvements on your code, which should improve the runtime.

Columns are fixed I would extract the column calculations and vectorize the real, child and other normalizations. Use apply rather than iterating (for zfill).

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
    if count % 100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    # use apply here rather than iterating
    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
    df.index = df.ConcatIndex

    #DataFrame Normalization:
    dftemp = df.very_deep_copy()  # don't *think* you need this

    # drop all excludes
    dftemp.drop(columns_to_drop), axis=1, inplace=True)

    # normalize real cols
    m = dftemp[real_cols_suffix].max()
    m.index = real_cols
    dftemp[real_cols] = dftemp[real_cols] / m

    # normalize child cols
    m = dftemp[child_cols_desc].max()
    m.index = child_cols
    dftemp[child_cols] = dftemp[child_cols] / m

    # normalize remaining
    remaining = list(remaining - child)
    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

    # if this case is important then discard the rows of m with .max() is 0
    #if max != 0:
    #    dftemp[string] = dftemp[string]/max

    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
    # dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

As a point of style I would probably choose to wrap each of these parts into functions, this will also mean more things can be gc'd if that really was the issue...


Another options which would be faster is to use pytables (HDF5Store) if you didn't need to resulting output to be csv (but I expect you do)...

The best thing to do by far is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).


Ah ha, I'd missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let's keep the file open:

# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

    if picklefile == '0000.p':
        dftemp.to_csv(f)
    else:
        dftemp.to_csv(f, mode='a', header=False)
...

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it could be that this is the expensive (I don't see why this should be that bad, but keeping it open removes the need to do this).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Hello @AndyHayden, Thank you for your suggestions. All-though, as I'm not used to vectorization and your optimizations might help a lot in performance, the big hit comes when I start using more memory and this exceeds my RAM size, so it's starts swapping (it's worst that you think, I'm using Windows). So how do you explain the fact that when I remove the line with `df.to_csv()` the execution times and memory usage become almost constant (check update #1)? – CMPSoares Mar 31 '15 at 23:38
  • 1
    @CMPSoares Thanks, did you check the sizes of all the files (is there a big one?). It is very surprising if this a memory issue unless these files are large (but I guess it could be!)... How much RAM do you have?? – Andy Hayden Apr 01 '15 at 03:00
  • Hello @AndyHayden, files aren't differing to much, smallest is 1 MB and the biggest one is 7MB. But they're aren't ordered by size and by eye I would say most of the smaller ones are at the end. So it wouldn't be size related. I only have 8GB of RAM. The total size of all pickle files is 10.4GB. But as I go through each of them separately it shouldn't give such a hit on memory. I did this purposely because I tried doing it all in memory before and my pc crashed. – CMPSoares Apr 01 '15 at 03:12
  • Btw. @AndyHayden I added the results from the `% prun` command. – CMPSoares Apr 01 '15 at 03:13
  • 1
    @CMPSoares hmmm since so much of the time is spent in close, how's about you open each file once, thinking about it this means that it is written to in a single pass. – Andy Hayden Apr 02 '15 at 16:34
  • Looks like it did in first test @AndyHayden. I'm doing some more tests for confirmation so I can apply the bounty with a fair judgement. – CMPSoares Apr 04 '15 at 17:11
  • 1
    Well I have to give this one to you because although Geeklhem's answer helped a bit in performance it only resulted in a approx. of 5% in execution time reduction. In your case with the last edit in your answer execution time got reduced in approx. 98.3% (from 2 days and 30 minutes to 50 minutes). Thank you very much for all your help! It has been very insightful! Happy Easter! – CMPSoares Apr 04 '15 at 19:11
  • 1
    @CMPSoares awesome, glad I could help! That's a staggering difference, I feel this should be included in pandas performance tips (not sure where that would be). Wowza. – Andy Hayden Apr 04 '15 at 19:31
  • I totally agree @AndyHayden because it's a little tweak I haven't read it anywhere else before. – CMPSoares Apr 04 '15 at 19:34
  • 1
    @CMPSoares one last follow up (to see if I can recreate this / add to performance tips), what was the file size of the resulting csv? – Andy Hayden Apr 04 '15 at 19:51
  • The siz is 3,67 GB, and it's dimensions are 2118819 rows x 621 collumns. After that please send me the link of the performance tips. – CMPSoares Apr 04 '15 at 22:40
  • Hey @AndyHayden any luck with the replication? – CMPSoares Apr 05 '15 at 17:02
  • @CMPSoares not got much time atm, but it's on my todo list (will ping when I've put something together!). – Andy Hayden Apr 06 '15 at 02:35
  • I've you want I can try and put something together and contribute on performance tips. But don't know exactly where that should be... – CMPSoares Apr 06 '15 at 06:53
  • @CMPSoares not sure where, certainly post on the mailing list (I'm sure people would be interested). in my brief testing I see a more modest 33% improvement. What OS are you using? (I think probably the filesystem affects this a lot, though tbh I'm not sure how this plays out: how much is allocated for a file - when it can be appended - or whether sometimes it has to be copied... this behaviour is deciding the complexity => times of this.) – Andy Hayden Apr 14 '15 at 03:59
  • Hey @AndyHayden, sorry for the late response. I'm using windows 8.1 64-bit as OS, with Anaconda's version of python 3.4.1 (Anaconda3 2.1.0 64-bit). OS does always make a difference... Specially when where talking in differences between Windows and Unix Systems... – CMPSoares Apr 20 '15 at 19:37
  • @CMPSoares please try timing this function and report back :) https://gist.github.com/hayd/22020d3fef347229728b – Andy Hayden Apr 20 '15 at 19:46
2

My guess would be that it comes from the very_deep_copy you are doing, did you check the memory usage over time ? It is possible that the memory is not freed correctly.

If that is the problem, you could do one of the following:

1) Avoid the copying altogether (better performance-wise).

2) Force a garbage collection using gc.collect() once in a while.

See "Python garbage collection" for a probably related issue, and this article for an introduction about garbage collection in python.

Edit:

A solution to remove the copy would be to:

1) store the normalizing constant for each column before normalizing.

2) drop the columns you do not need after the normalization.

# Get the normalizing constant for each column.
max = {}

for string in head:
    if string not in exclude:
        if 'Real ' in string:
           max[string] = df[string.strip('Real ')].max()
        elif 'child' in string:
           max[string] = df[string.strip('child')+'desc'].max()
        else:
           max[string] = df[string].max()

# Actual normalization, each column is divided by
# its constant if possible. 
for key,value in max.items():
    if value != 0:
        df[key] /= value

# Drop the excluded columns 
df.drop(exclude, axis=1, inplace=True)
Community
  • 1
  • 1
Geeklhem
  • 689
  • 7
  • 12
  • 1
    What leads you to think it's gc? – Andy Hayden Mar 30 '15 at 07:00
  • Well, I tested (first 300 picklefiles) the difference between using `gc.collect()` at every iteration and before, and it does indeed make as huge difference in memory usage, all-though it is a little slower (at least for the first 300). Will do a complete test and give more feedback later. – CMPSoares Mar 31 '15 at 22:57
  • 1
    Indeed, `collect` can be slow, you do not want to run it at every single iteration. Maybe you could try to call it in your `if count%100 == 0:` block. – Geeklhem Mar 31 '15 at 23:36
  • Yes I did (for the first 300) and it gets somewhere in between with performance, but memory usage would still increase, all-though a lot less... But I think that this might slow it down in the beginning but compensate later when normally my system starts swapping... – CMPSoares Mar 31 '15 at 23:42
  • 1
    Did you try to remove the `very_deep_copy` altogether (e.g as I suggested in my answer) ? I think this is your best bet performance-wise. – Geeklhem Mar 31 '15 at 23:52
  • Can't, very_deep_copy was created to get correct data. other methods wheren't copying the dataframe but pointing it to the new value. In which certain normalizations wouldn't execute correctly. – CMPSoares Apr 01 '15 at 00:41
  • But I'm checking, and even with the garbage collection execution time grows, but when I remove the `to_csv()` block, keeping the rest of the code intact, it's constant. It must have something to do with the `pandas.DataFrame.to_csv()` append method. but I tried reading the code but this python is to advanced for my experience with it. – CMPSoares Apr 01 '15 at 00:45
  • 1
    If I understand correctly, your normalizations consist in dividing each column by the maximum of another column. You do the copy because you want to get the maximum of other columns __before__ they are normalized. A simple workaround would be to store all normalizing constants (the maximums) in a first time and modify the Dataframe afterward. I updated the proposed snippet in my answer. – Geeklhem Apr 01 '15 at 01:53
  • 1
    As for the `to_csv()` issue, I have to admit that I am having trouble finding what could cause this, as I use it quite often. At is core it use the [`csv`](https://docs.python.org/2/library/csv.html?highlight=csv%20writer#csv.writer) module of python's standard library (with a [chunk](https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1516) division and a little bit of [cython](https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L936) to speed things up). Maybe there's an issue with big files and windows ? Did you try to save into different files rather than append ? – Geeklhem Apr 01 '15 at 02:22
  • I know that there are possible workarounds to the normalization, but as it's used for one or to executions it's not a problem. I posted the question because I noticed that when using the `to_csv` method time would gradually increase. As if it's loading the whole file each time it appends another dataframe. Which wouldn't be all too efficient. And yes I tried for curiosity writing it to different files and execution is more constant and faster, although it occupies a lot more disk space. – CMPSoares Apr 01 '15 at 17:55