Manipulating data in DataFrame: how to calculate the square of a column

Question

I would like to calculate the square of a column A 1,2,3,4, process it with other calculation store it in column C

using CSV, DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
df.C = ((((df.A./2).^2).*3.14)./1000)

Is there an easier way to write it?

score 4 · Answer 1 · answered Jul 09 '20 at 15:33

4

I am not sure how much shorter you would want the formula to be, but you can write:

df.C = @. (df.A / 2) ^ 2 * 3.14 / 1000

to avoid having to write . everywhere.

Or you can use transform!, but it is not shorter (its benefit is that you can uset it in a processing pipeline, e.g. using Pipe.jl):

transform!(df, :A => ByRow(a -> (a / 2) ^ 2 * 3.14 / 1000) => :C)

answered Jul 09 '20 at 15:33

Bogumił Kamiński

66,844
3
80
107

Many thanks to both of you @Bogumił Kamiński and Przemyslaw Szufel! As both your answers are equally good, I upvote both of them – ecjb Jul 09 '20 at 20:08

Przemyslaw Szufel · Answer 2 · 2020-07-09T16:28:30.120

3

Try this:

df.D = .5df.A .^2 * 0.00314

Explanation:

not so many parentheses needed
multiplying scalar by vector is here as good as the vectorization for short vectors (up two something like 100 elements)

A simple benchmark using BenchmarkTools:

julia> @btime $df.E = .5*$df.A .^2 * 0.00314;
  592.085 ns (9 allocations: 496 bytes)

julia> @btime $df.F = @. ($df.A / 2) ^ 2 * 0.00314;
  875.490 ns (11 allocations: 448 bytes)

The fastest is however a longer version where you provide the type information @. (df.A::Vector{Int} / 2) ^ 2 * 0.00314 (again this matters rather for short DataFrames and note that here the Z column must exist so we create it here):

julia> @btime begin $df.Z = Vector{Float64}(undef, nrow(df));@. $df.Z = ($df.A::Vector{Int} / 2.0) ^ 2.0 * 0.00314; end;
  162.564 ns (3 allocations: 208 bytes)

edited Jul 09 '20 at 16:28

answered Jul 09 '20 at 15:32

Przemyslaw Szufel

40,002
3
32
62

1

Not doing broadcasting, as you propose is actually much slower than broadcasting everything, as you are allocating 3 times not 1 time. – Bogumił Kamiński Jul 09 '20 at 15:36
Actually it depends on the length of the vector (number of rows of the `DataFrame`) for short vectors algebraic multiplication is faster. The break-even is around 100 rows. I edited the answer. – Przemyslaw Szufel Jul 09 '20 at 15:58
But this is only due to type instability of the code in the examples you benchmark. If you made the operation type stable broadcasting everything is faster. – Bogumił Kamiński Jul 09 '20 at 16:07
I added the type stable version to the benchmarks and then indeed for such short vector this is many times faster but the code is kind of long :-) – Przemyslaw Szufel Jul 09 '20 at 16:28
Many thanks to both of you @Przemyslaw Szufel and Bogumił Kamiński. As both your answers are equally good, I upvote both of them – ecjb Jul 09 '20 at 20:08

Manipulating data in DataFrame: how to calculate the square of a column

2 Answers2