5

I have a df which contains, nothing, NaN and missing. to remove rows which contains missing I can use dropmissing. Is there any methods to deal with NaN and nothing?

Sample df:

│ Row │ x       │ y    │
│     │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1   │ 1.0     │ 'a'  │
│ 2   │ missing │ 'b'  │
│ 3   │ 3.0     │ 'c'  │
│ 4   │         │ 'd'  │
│ 5   │ 5.0     │ 'e'  │
│ 6   │ NaN     │ 'f'  │

Expected output:

│ Row │ x   │ y    │
│     │ Any │ Char │
├─────┼─────┼──────┤
│ 1   │ 1.0 │ 'a'  │
│ 2   │ 3.0 │ 'c'  │
│ 3   │ 5.0 │ 'e'  │

What I have tried so far, Based on my knowledge in Julia I tried this,

df.x = replace(df.x, NaN=>"something", missing=>"something", nothing=>"something")
print(df[df."x".!="something", :])

My code is working as I expected. I feel it's ineffective way of solving this issue. Is there any separate method to deal with nothing and NaN?

Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111

1 Answers1

8

You can do e.g. this:

julia> df = DataFrame(x=[1,missing,3,nothing,5,NaN], y='a':'f')
6×2 DataFrame
│ Row │ x       │ y    │
│     │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1   │ 1.0     │ 'a'  │
│ 2   │ missing │ 'b'  │
│ 3   │ 3.0     │ 'c'  │
│ 4   │         │ 'd'  │
│ 5   │ 5.0     │ 'e'  │
│ 6   │ NaN     │ 'f'  │

julia> filter(:x => x -> !any(f -> f(x), (ismissing, isnothing, isnan)), df)
3×2 DataFrame
│ Row │ x       │ y    │
│     │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1   │ 1.0     │ 'a'  │
│ 2   │ 3.0     │ 'c'  │
│ 3   │ 5.0     │ 'e'  │

Note that here the order of checks is important, as isnan should be last, because otherwise this check will fail for missing or nothing value.

You could also have written it more directly as:

julia> filter(:x => x -> !(ismissing(x) || isnothing(x) || isnan(x)), df)
3×2 DataFrame
│ Row │ x       │ y    │
│     │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1   │ 1.0     │ 'a'  │
│ 2   │ 3.0     │ 'c'  │
│ 3   │ 5.0     │ 'e'  │

but I felt that the example with any is more extensible (you can then store the list of predicates to check in a variable).

The reason why only a function for removing missing is provided in DataFrames.jl is that this is what is normally considered to be a valid but desirable to remove value in data science pipelines.

Normally in Julia when you see nothing or NaN you probably want to handle them in a different way than missing as they most likely signal there is some error in the data or in processing of data (as opposed to missing which signals that the data was just not collected).

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107