Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2809 questions
1068
votes
20 answers

Remove rows with all or some NAs (missing values) in data.frame

I'd like to remove the lines in this data frame that: a) contain NAs across all columns. Below is my example data frame. gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA NA 2 ENSG00000199674 0 2 2 2 …
Benoit B.
  • 11,854
  • 8
  • 26
  • 29
149
votes
10 answers

How to lowercase a pandas dataframe string column if it has missing values?

The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is…
P.Escondido
  • 3,373
  • 6
  • 23
  • 29
94
votes
2 answers

str.format() raises KeyError

The following code raises a KeyError exception: addr_list_formatted = [] addr_list_idx = 0 for addr in addr_list: # addr_list is a list addr_list_idx = addr_list_idx + 1 addr_list_formatted.append(""" "{0}" { …
Dor
  • 7,344
  • 4
  • 32
  • 45
86
votes
14 answers

Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck: for (Var in names(airquality)) { …
Zach
  • 29,791
  • 35
  • 142
  • 201
84
votes
7 answers

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced

I am trying to convert a csv into numpy array. In the numpy array, I am replacing few elements with NaN. Then, I wanted to find the indices of the NaN elements in the numpy array. The code is : import pandas as pd import matplotlib.pyplot as…
Thedeadman619
  • 881
  • 1
  • 7
  • 14
78
votes
5 answers

Delete rows with blank values in one particular column

I am working on a large dataset, with some rows with NAs and others with blanks: df <- data.frame(ID = c(1:7), home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"), …
KT_1
  • 8,194
  • 15
  • 56
  • 68
75
votes
9 answers

Format string unused named arguments

Let's say I have: action = '{bond}, {james} {bond}'.format(bond='bond', james='james') this wil output: 'bond, james bond' Next we have: action = '{bond}, {james} {bond}'.format(bond='bond') this will output: KeyError: 'james' Is there some…
nelsonvarela
  • 2,310
  • 7
  • 27
  • 43
71
votes
6 answers

Python, Pandas : Return only those rows which have missing values

While working in Pandas in Python... I'm working with a dataset that contains some missing values, and I'd like to return a dataframe which contains only those rows which have missing data. Is there a nice way to do this? (My current method to do…
user2487726
67
votes
14 answers

Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for…
Nikita
  • 907
  • 2
  • 11
  • 14
67
votes
1 answer

Include levels of zero count in result of table()

I have a vector 'y' and I count the different values using table: y <- c(0, 0, 1, 3, 4, 4) table(y) # y # 0 1 3 4 # 2 1 1 2 However, I also want the result to include the fact that there are zero 2's and zero 5's. Can I use table() for…
Christopher DuBois
  • 42,350
  • 23
  • 71
  • 93
60
votes
3 answers

What is the difference between and NA?

I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) # N N Y Y N # Levels: Y N Why is R displaying NA…
oort
  • 1,840
  • 2
  • 20
  • 29
59
votes
9 answers

Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below: timestamp …
James A
  • 655
  • 2
  • 7
  • 8
57
votes
5 answers

Replace NA with previous or next value, by group, using dplyr

I have a data frame which is arranged by descending order of date. ps1 = data.frame(userID = c(21,21,21,22,22,22,23,23,23), color = c(NA,'blue','red','blue',NA,NA,'red',NA,'gold'), age =…
Tarak
  • 1,035
  • 2
  • 8
  • 14
50
votes
10 answers

How do I get a summary count of missing/NaN data by column in 'pandas'?

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values. I gather I can do something like len(mydata.index) - mydata.count() to compute the number…
orome
  • 45,163
  • 57
  • 202
  • 418
49
votes
3 answers

Convert NA into a factor level

I have a vector with NA values that I would like to replace by a new factor level NA. a = as.factor(as.character(c(1, 1, 2, 2, 3, NA))) a [1] 1 1 2 2 3 Levels: 1 2 3 This works, but it seems like a strange way to do it. a =…
marbel
  • 7,560
  • 6
  • 49
  • 68
1
2 3
99 100