0

I am trying to speed up a function to be performed on a dataframe. I originally used iterrows but its definitely slower so I used apply. It definitely improved but I would like to use np.vectorize for better performance. My problem is how do I pass the columns of the dataframe to the function (assume the number of columns or names of the columns can vary). How can I pass columns then iterate through them. I guess I can use *args on the function parameter but how to I pass and break the columns of the df? I also dont want to make copies of the df (assume the df is extremely large). Hope my question is clear

For example lets say I just want to print the data into some format like so:

data = [ {"a": str(x), "b": x, "c": x} for x in range(10)]
df = pd.DataFrame(data)

def func(row):
    print(f"{row.to_json()}:", end="")

_ = df.apply(func, axis=1)

This was much faster than doing iterrows, but how can I improve this further? If this is the example case, also assume number of columns and its name can vary.

user1179317
  • 2,693
  • 3
  • 34
  • 62
  • 1
    from the docs: "The [vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." – acushner Aug 14 '20 at 02:06
  • it really depends on _what exactly_ you want to do as to whether or not you can speed it up using built-in numpy functions that are vectorized at the machine level. – acushner Aug 14 '20 at 02:07
  • You can use list comprehension `[func(*x) for x in zip(df[col1],df[col2]...)]` – Henry Yik Aug 14 '20 at 02:09
  • if you can use numpy/pandas functions, then go ahead. You've got numba and cython for performance improvement. If they are string functions however, then running within python will likely give you faster speed than using Pandas string methods. – sammywemmy Aug 14 '20 at 02:11
  • here is a long blog post on performance in pandas, by Tom Augspurger, a pandas core contributor: https://tomaugspurger.github.io/modern-4-performance – jsmart Aug 14 '20 at 03:09

1 Answers1

0

In general, in order to improve performance, you need to use numpy / pandas built-in methods as much as possible and resort to apply or numpy.vectorize only when it is absolutely necessary. For more information, please refer to this post from cs95.

For example, the functionality in your post can be implemented using pandas.DataFrame.to_json

Method 1:

print(df.to_json(orient="records", lines=True).replace(os.linesep, ':'), end=':')

Method 2:

print(':'.join(df.to_json(orient="records", lines=True).split()), end=':')

Output:

{"a":"0","b":0,"c":0}:{"a":"1","b":1,"c":1}:{"a":"2","b":2,"c":2}:{"a":"3","b":3,"c":3}:{"a":"4","b":4,"c":4}:{"a":"5","b":5,"c":5}:{"a":"6","b":6,"c":6}:{"a":"7","b":7,"c":7}:{"a":"8","b":8,"c":8}:{"a":"9","b":9,"c":9}:

The performance test code and results are shown in the following. For comparison purpose I also added variants of your original solution that print only once.

import sys, os

data = [ {"a": str(x), "b": x, "c": x} for x in range(10)]
df = pd.DataFrame(data)

def func(row, file=sys.stdout):
    print(f"{row.to_json()}:", end="", file=file)

def func2(row):
    return row.to_json()

null_output = open(os.devnull, 'w')

%timeit df.apply(func, axis=1, file=null_output)
%timeit print(':'.join(df.apply(func2, axis=1)), end=':', file=null_output)
%timeit print(':'.join(df.apply(pd.Series.to_json, axis=1)), end=':', file=null_output)
%timeit print(df.to_json(orient="records", lines=True).replace(os.linesep, ':'), end=':', file=null_output)
%timeit print(':'.join(df.to_json(orient="records", lines=True).split()), end=':', file=null_output)

Output:

1.67 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.67 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.68 ms ± 42.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
51.8 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
51.4 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
GZ0
  • 4,055
  • 1
  • 10
  • 21
  • Yes, so I am initially using built in df.to_json function actually. The problem is, my df is actually quite large. Lets say 4GB in memory, unfortunately, doing df.to_json will more than double the use of memory, so I am trying to kind of iterate through the df and just print it out. – user1179317 Aug 14 '20 at 03:27
  • It would be better that you mentioned this in your OP. In that case, you can split your dataframe into several chunks rather than iterating row by row. – GZ0 Aug 14 '20 at 03:50
  • Yes I should've. Would splitting the dataframe into chunks make a copy of the dataframe though? – user1179317 Aug 14 '20 at 03:58
  • As far as I can recall, slicing on rows would make a copy in most cases unless the columns of the slice are of the same *numerical* `dtype`. Here are some possible options: (1) set the chunk size to make the copy fit in the memory; (2) save the output into a file by passing the output path to `df.to_json()` and then operate on the file to get your desired output format; (3) if the dataframe is directly obtained from a file, you could consider reading it in chunks (see [this](https://pythonspeed.com/articles/chunking-pandas/)). – GZ0 Aug 14 '20 at 04:23