-1

I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.

My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!

jpsmith
  • 11,023
  • 5
  • 15
  • 36
  • 1
    Welcome to SO, bobmahone13! This sounds like run-length encoding using R's [`rle` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/rle.html). If you need more than that, please make this question *reproducible*. This includes sample code you've attempted (including listing non-base R packages, and any errors/warnings received), sample *unambiguous* data (e.g., `data.frame(x=...,y=...)` or the output from `dput(head(x))`), and intended output given that input. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Feb 07 '23 at 20:50
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 07 '23 at 21:10

1 Answers1

0

If I understand this phrase correctly:

[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column

You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.

Example data

df <- data.frame(A = LETTERS[1:10],
                 B = LETTERS[1:10],
                 C = LETTERS[1:10],
                 WANT = LETTERS[1:10],
                 E = LETTERS[1:10])

set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA

#   A B C WANT E
#1  A A A    A A
#2  B B B    B B
#3  C C C <NA> C
#4  D D D    D D
#5  E E E    E E
#6  F F F    F F
#7  G G G    G G
#8  H H H    H H
#9  I I I    I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J

Here you would want all I values as it is the row with the longest running non-NA values in WANT.

If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.

The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()

# from @jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
  r <- rle(x)
  s <- sapply(r$lengths, seq_len)
  s[r$values] <- lapply(s[r$values], `*`, 0)
  unlist(s)
}))

# index using which.max()
want_data <- df[which.max(res$WANT), ]

#> want_data
#  A B C WANT E
#9 I I I    I I

If this isn't correct, please edit your question for clarity.

jpsmith
  • 11,023
  • 5
  • 15
  • 36