4

I currently have a Julia dataframe of the form

A B
"[1,2]" "[3,4]"

and would like to make it of the form

A1 A2 B1 B2
1 2 3 4

or of the form (where the vectors are no longer strings).

| A | B | |---|---| |[1,2]|[3,4]| is there any way to do this? I have already looked at a few posts where people try to convert vectors of the form ["1", "2"] to the form [1,2] but nothing along the lines of what I have.

Thanks for the help.

2 Answers2

2

Here is an example way how you can do it:

julia> using DataFrames

julia> df = DataFrame(A="[1,2]", B="[3,4]")
1×2 DataFrame
 Row │ A       B
     │ String  String
─────┼────────────────
   1 │ [1,2]   [3,4]

julia> select(df, [:A, :B] .=>
                  ByRow(x -> parse.(Int, split(chop(x, head=1, tail=1), ','))) .=>
                  [[:A1, :A2], [:B1, :B2]])
1×4 DataFrame
 Row │ A1     A2     B1     B2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

If something requires an explanation please ask in the comment.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
0

Define

# WARNING: `parsedf` parses/evaluates input text and
# is therefore an infosec weakness. Should be used only
# with properly vetted input.

parsedf(df) = DataFrame([c=>eval.(Meta.parse.(df[:,c])) 
                         for c in names(df)])

spreaddf(df) = DataFrame([c*"$i" => get.(df[:, c],i,missing) 
                          for (c, i) in vcat([[(names(df)[i],j)
                          for j=1:L] 
                          for (i,L) in enumerate([maximum(length.(df[:,i])) 
                          for i in 1:ncol(df)])]...)]...)

Now,

julia> df = DataFrame(A=["[1,2]"],B=["[3,4]"])
1×2 DataFrame
 Row │ A       B      
     │ String  String 
─────┼────────────────
   1 │ [1,2]   [3,4]

julia> spreaddf(parsedf(df))
1×4 DataFrame
 Row │ A1     A2     B1     B2    
     │ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 │     1      2      3      4

seems to do it.

Also,

julia> spreaddf(parsedf(DataFrame(A=["[1,2]","[5,6]"], B=["[3,4]","[7,8,9]"])))
2×5 DataFrame
 Row │ A1     A2     B1     B2     B3      
     │ Int64  Int64  Int64  Int64  Int64?  
─────┼─────────────────────────────────────
   1 │     1      2      3      4  missing 
   2 │     5      6      7      8        9

seems appropriate.

Dan Getz
  • 17,002
  • 2
  • 23
  • 41
  • 2
    note that OP has strings not vectors in cells of the source data frame. Most likely OP saved a vector of vectors to a data frame in CSV and then read it back. Indeed, it would be better to use e.g. Arrow.jl for saving such data. – Bogumił Kamiński Jul 19 '22 at 22:12
  • Thanks @BogumiłKamiński for pointing the missing OP nuance. Fixed it somehow (quite dangerously in terms of infosec - but this is not part of request). – Dan Getz Jul 19 '22 at 22:27
  • It is worth amplifying @DanGetz comment about infosec. The code here calls `Meta.parse`. This is almost certainly just what the OP was asking for, but it is most certainly not what they *should* be asking for. Parsing user input with a general parser is incredibly dangerous, even in one-off situations (since that kind of code tends to out-live expectations). See the recent log4j debacle for a clear example of how innocuous it can seem to let this sort of thing into a codebase. – Ted Dunning Jul 20 '22 at 02:32