3

enter image description hereI have merged two data frames using bind_rows. I have a situation where I have two rows of data as for example below:

Page Path                           Page Title             Byline      Pageviews 
/facilities/when-lighting-strikes      NA                    NA           668
/facilities/when-lighting-strikes   When Lighting Strikes  Tom Jones       NA

When I have these type of duplicate page paths I'd like to merge the identical page paths, eliminate the two NA's in the first row keeping the page title (When Lighting Strikes) and Byline (Tom Jones) and then keep the pageviews result of 668 from the first row. Somehow it seems that I need

  1. to identify the duplicate pages paths
  2. look to see if there are different titles and bylines; remove NAs
  3. keep the row with the pageview result; remove the NA row

Is there a way I can do this in R dplyr? Or is there a better way?

  • Here's a related Q&A: https://stackoverflow.com/questions/28509462/how-to-collapse-many-records-into-one-while-removing-na-values – talat May 31 '16 at 14:35

5 Answers5

3

A simple solution:

library(dplyr)

df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
# 
#                            PagePath             PageTitle    Byline Pageviews
#                              (fctr)                (fctr)    (fctr)     (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

If your data is more complicated, you may need a more robust approach.


Data

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    I should have noted that my data contains multiple rows. I'm getting a message about "expecting a single value" when I attempt to use summarise_each(funs(na.omit)). – Deniz Mehmed May 31 '16 at 19:56
  • If you post a representative subset of data, I might be able to help. If it's just that you've got some rows actually filled in correctly already, you could wrap in `unique` to collapse: `summarise_each(funs(unique(na.omit(.))))`. Otherwise you may need to group by another variable if you're shooting for more than one result row for each group. – alistaire May 31 '16 at 20:58
  • I've posted a more respresentative data sample to the original question and have highlighted two rows. So for instance I would like to merge the two identical pagePaths /facilities/management-operators/when-lighting-strikes. In merging I would like to preserve the pageTitle and byline from the second row and the pageviews from the top row. – Deniz Mehmed Jun 01 '16 at 16:34
  • Don't post images of data, because other people can't import them, instead [post the results of `dput(head(df))`](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). It looks like it's the other columns you didn't initially show that are causing you problems, though; the `unique` approach above should work fine. – alistaire Jun 01 '16 at 16:53
2
Use replace function in for loop

for(i in unique(df$Page_Path)){
  df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
    df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}

df <- subset(df, !is.na(Page_Title))

print(df)

                          Page_Path            Page_Title    Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668
Arun kumar mahesh
  • 2,289
  • 2
  • 14
  • 22
0

Another way to do this (similar to a previous solutions that uses dplyr) would be:

  df %>% group_by(PagePath) %>% 
  dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
                   Byline = paste(na.omit(Byline)),
                   Pageviews =paste(na.omit(Pageviews)))
eminik
  • 145
  • 7
  • Thank you. I've tried the below but when I attempt to view I get an error: expecting a single value. I've tried the below using dplyr:: rename(byline = dimension2) %>% dplyr::rename(Site = profileName)%>% group_by(pagePath)%>%dplyr::summarize(pageTitle = paste(na.omit(pageTitle)),byline = paste(na.omit(byline)),pageviews=paste(na.omit(pageviews))). I should mention that this is Google Analytics data. – Deniz Mehmed May 31 '16 at 18:42
  • Can you provide some example data for which you get this error? It would be easier to solve this if we can reproduce the error. – eminik Jun 01 '16 at 11:08
0

Here is an option using data.table and complete.cases. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'PathPath', loop through the columns of the dataset (lapply(.SD, ..) and remove the NA elements with complete.cases. The complete.cases returns a logical vector and can be used for subsetting. According to this, complete.cases usage is much more faster than na.omit and coupled with data.table it would increase the efficiency.

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
#                     PagePath             PageTitle    Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

data

df <- structure(list(PagePath = structure(c(1L, 1L), 
 .Label = "/facilities/when-lighting-strikes", class = "factor"),   
    PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
    Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
    Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
-2L))
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
0

An alternative approach using fill. Using tidyverse 1.3.0+ with dplyr 0.8.5+, you can use fill to fill in missing values.

See this for more information https://tidyr.tidyverse.org/reference/fill.html

DATA Thanks Alistaire

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes NA                    NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

CODE

I just did this for PageTitle but you can repeat fill to do it for other columns. (dplyr gurus might have a smarter way to do all 3 columns at once). If you have ordered data like dates, then you can set .direction to be just down for example (look at past data).

df.new <- df %>% group_by(PagePath) 
             %>% fill(PageTitle, .direction = "updown")

which gives you

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

Once you have all the NAs cleaned up then you can use distinct or rank to get your final summarised dataframe.

micstr
  • 5,080
  • 8
  • 48
  • 76