1

I am reading a csv file and unfortunately my dataframe has many missing values. A small snip is as following:

dataframe

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

I want to predict some of the results using weka but I can't do it if I have multiple attributes missing. I know that I should be using the function is.na but I am not sure in what way it can be done because so far I used it only for summing and counting.

Edit: For an example, in this file I have missing values at 4 out of the 5 instances. Instances 2 and 5 share the same missing attributes (B and D), while instances 1 and 4 share the same missing value as well (C). What I'd like to get is a dataframe that consists out of those instances so I could export them into files and run analysis on those files individually. An example of an output could be

> A

A

> B

B

Edit 2:

I want to save the splits and so far I tried this:

write.csv(split(temp, index), file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)

But it writes all the splits in one line. Is there a way to separate them by a line?

Edit 3:

My steps are:

data <- read.csv("location")
index <- apply(is.na(data)*1, 1,paste, collapse = "")
s <- split(data, index)
lapply(s, function(x) {names(x) <- names(data);x})
big.data <- do.call(rbind, s)
write.csv(big.data, file = "location", row.names=FALSE)

Am I missing something?

A.J
  • 1,140
  • 5
  • 23
  • 58
  • Please explain well your problem and if possible add a reproducible example or at least a desired output. Help users to help you. – SabDeM Jun 15 '15 at 14:10
  • Question unclear http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Pierre L Jun 15 '15 at 14:10

2 Answers2

1
df[!is.na(df$Value), ]
  Size Value Location Num1 Num2 Rent
1  800   900     <NA>    2    2    y
3 1100  1300   uptown    3    3    n
4 1200  1100     <NA>    2    1    y

And

df[is.na(df$Value), ]
  Size Value Location Num1 Num2 Rent
2  850    NA  midcity   NA    3    y
5 1000    NA Lakeview   NA    2    n

In the future, please create a reproducible example so that users do not have to create a data frame by hand from your question. Pictures are not as helpful.

Data

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

To combine it all use lapply since split creates a list:

lapply(split(temp, index), write.csv, file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)

With a for loop:

s <- split(temp, index)
for (i in 1:length(s)) {
  write.csv(s[i], file = paste0("C:/Users/Nikita/Desktop/", i, "splits.csv"), row.names=FALSE)
}
Community
  • 1
  • 1
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • Thank you for the answer and for the provided code! – A.J Jun 15 '15 at 15:14
  • It saves only the first split. Do I need to use a loop or am I doing something wrong? – A.J Jun 15 '15 at 19:14
  • 1
    I forgot that there is one file being fed into the function. So it will try to write all of the splits into that file. try a for loop. I edited my answer again. – Pierre L Jun 15 '15 at 19:21
  • Worked perfectly! Thank you. – A.J Jun 15 '15 at 20:03
  • Just a quick follow-up question: Is there a way to save the files with the original column names, or it must contain the combination number every single column? – A.J Jun 16 '15 at 14:38
  • try `lapply(s, function(x) {names(x) <- names();x}` – Pierre L Jun 16 '15 at 14:48
  • When I run it `lapply(s, function(x) {names(x) <- names();x}`, the next error appears `Error: unexpected '<' in "lapply(s, function(x) {names(x) <- names(<"`. Syntax problem with the `<` I assume but no idea why. – A.J Jun 16 '15 at 15:15
  • 1
    lol User. I put those greater than and less than symbols to tell you to enter in the name of the dataframe. Don't actually put that in. For example if your original data frame with all the column names that you want was called 'mydf'. you would enter `lapply(s, function(x) {names(x) <- names(mydf);x}` – Pierre L Jun 16 '15 at 15:18
  • Well now I feel stupid ha. And it worked but I still don't understand how I transform it into the one big file. I am terrible at this. – A.J Jun 16 '15 at 15:51
  • `big.df <- do.call(rbind, s)` then you can write it to file with `write.csv(big.df)`. Is that what you are referring to ? – Pierre L Jun 16 '15 at 15:58
  • For some reason it doesn't split it but instead saves the whole dataframe. I added my steps as edit 3. – A.J Jun 16 '15 at 17:31
1

Recreating your example data:

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

Now, splitting your data according to the pattern of NA as you want:

# This generates an index with 1 for a column with NA and 0 otherwise
index <- apply(is.na(df)*1, 1,paste, collapse = "")

# This splits the data.frame according to the index
split(df, index)
$`000000`
  Size Value Location Num1 Num2 Rent
3 1100  1300   uptown    3    3    n

$`001000`
  Size Value Location Num1 Num2 Rent
1  800   900     <NA>    2    2    y
4 1200  1100     <NA>    2    1    y

$`010100`
  Size Value Location Num1 Num2 Rent
2  850    NA  midcity   NA    3    y
5 1000    NA Lakeview   NA    2    n

Notice that the first element "000000" comprises all the observations with complete cases. Then "001000" comprises all observations where column 3 (location) is missing. And so on.

Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
  • It works perfect. Thank you! Another quick question though. This was a really small part of my data. My whole dataframe consists out of 244 attributes. Is there a way that I can automatically output the results into a file/files? – A.J Jun 15 '15 at 15:15
  • 1
    @User you can save all the results in an object `results <- split(df, index)` and then save the results in csv files `for(i in 1:length(results)) write.csv(results[[i]], file = paste0("C:/Users/Nikita/Desktop/", "splits", i, ".csv"), row.names=FALSE)`. – Carlos Cinelli Jun 15 '15 at 22:19
  • Thank you! I used plafort's approach because he answered earlier but thanks anyway. A quick question: is there a way to save the files with the original column names, or it must contain the combination number every single column? – A.J Jun 16 '15 at 14:39