6

I would like to calculate the square of a column A 1,2,3,4, process it with other calculation store it in column C

using CSV, DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
df.C = ((((df.A./2).^2).*3.14)./1000)

Is there an easier way to write it?

DV82XL
  • 5,350
  • 5
  • 30
  • 59
ecjb
  • 5,169
  • 12
  • 43
  • 79

2 Answers2

4

I am not sure how much shorter you would want the formula to be, but you can write:

df.C = @. (df.A / 2) ^ 2 * 3.14 / 1000

to avoid having to write . everywhere.

Or you can use transform!, but it is not shorter (its benefit is that you can uset it in a processing pipeline, e.g. using Pipe.jl):

transform!(df, :A => ByRow(a -> (a / 2) ^ 2 * 3.14 / 1000) => :C)
Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • Many thanks to both of you @Bogumił Kamiński and Przemyslaw Szufel! As both your answers are equally good, I upvote both of them – ecjb Jul 09 '20 at 20:08
3

Try this:

df.D = .5df.A .^2 * 0.00314

Explanation:

  • not so many parentheses needed
  • multiplying scalar by vector is here as good as the vectorization for short vectors (up two something like 100 elements)

A simple benchmark using BenchmarkTools:

julia> @btime $df.E = .5*$df.A .^2 * 0.00314;
  592.085 ns (9 allocations: 496 bytes)

julia> @btime $df.F = @. ($df.A / 2) ^ 2 * 0.00314;
  875.490 ns (11 allocations: 448 bytes)

The fastest is however a longer version where you provide the type information @. (df.A::Vector{Int} / 2) ^ 2 * 0.00314 (again this matters rather for short DataFrames and note that here the Z column must exist so we create it here):

julia> @btime begin $df.Z = Vector{Float64}(undef, nrow(df));@. $df.Z = ($df.A::Vector{Int} / 2.0) ^ 2.0 * 0.00314; end;
  162.564 ns (3 allocations: 208 bytes)
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • 1
    Not doing broadcasting, as you propose is actually much slower than broadcasting everything, as you are allocating 3 times not 1 time. – Bogumił Kamiński Jul 09 '20 at 15:36
  • Actually it depends on the length of the vector (number of rows of the `DataFrame`) for short vectors algebraic multiplication is faster. The break-even is around 100 rows. I edited the answer. – Przemyslaw Szufel Jul 09 '20 at 15:58
  • But this is only due to type instability of the code in the examples you benchmark. If you made the operation type stable broadcasting everything is faster. – Bogumił Kamiński Jul 09 '20 at 16:07
  • I added the type stable version to the benchmarks and then indeed for such short vector this is many times faster but the code is kind of long :-) – Przemyslaw Szufel Jul 09 '20 at 16:28
  • Many thanks to both of you @Przemyslaw Szufel and Bogumił Kamiński. As both your answers are equally good, I upvote both of them – ecjb Jul 09 '20 at 20:08