2

I would like to count islands along rows in a .csv. I say "islands" meaning consecutive non-blank entries on rows of the .csv. If there are three non-blank entries in a row, I would like that to be counted as 1 island. Anything less than three consecutive entries in a row counts as 1 "non-island". I would then like to write the output to a dataframe:

Name,,,,,,,,,,,,,
Michael,,,1,1,1,,,,,,,,
Peter,,,,1,1,,,,,,,,,
John,,,,,1,,,,,,,,,

Desired dataframe output:

Name,island,nonisland,
Michael,1,0,
Peter,0,1,
John,0,1,
agrobins
  • 109
  • 1
  • 7

1 Answers1

3

You could use rle like this;

output <- stack(sapply(apply(df, 1, rle), function(x) sum(x$lengths >= 3)))
names(output) <- c("island", "name")

output$nonisland <- 0
output$nonisland[output$island == 0] <- 1
#  island    name nonisland
#1      1 Michael         0
#2      0   Peter         1
#3      0    John         1

Here you run rle across the rows of your data frame. Then look through and add up occurrences when you find lengths of 3 or more.

Note that this solution assumes all islands are made up of the same thing (i.e. all 1's as in your example). If that is not the case, you would need to convert all the non-empty entries to be the same thing by doing something like this: df[!is.na(df)] <- 1 before rle will be appropriate.

Jota
  • 17,281
  • 7
  • 63
  • 93
  • thanks Frank! you a re correct, I need to convert all non-empty entries to the same value. however, this replaces the "names" entries to NA (no Michael, John, Peter). Is there a solution to this? – agrobins Jun 04 '15 at 22:25
  • 1
    If you provide the `dput` of your data (e.g. `dput(df)`) that shows this issue, I'll be able to understand what you mean. That said, without seeing what you're talking about, I would guess that you should make the names column into row names, either upon import (see `?read.csv` and the `row.names` argument) or after. – Jota Jun 04 '15 at 22:39
  • figured it out from the read.csv man page. thank you so much, this was very helpful! – agrobins Jun 04 '15 at 23:03