How to merge data frame columns based on value in R

Question

I am trying to perform some analysis on the Online New Popularity dataset from the UCI Open Data Repo here: https://archive.ics.uci.edu/ml/datasets/online+news+popularity

The dataset has a set of 7 boolean attributes that denote the day of the week that the article was published on. For example, the column weekday_is_monday will have the value 1 if the article was published on a Monday and so on. For my analysis, I am trying to merge these fields into a single field that contains the string literal of publishing day.

So I load this dataset then go through and replace each true value with the string literal:

news <- read.csv("path_to_my_dataset",
                 header=TRUE,
                 sep=",",
                 fill=F,
                 strip.white = T,
                 stringsAsFactors=FALSE)
news$weekday_is_monday <- gsub('^1', 'Monday', news$weekday_is_monday)
news$weekday_is_tuesday <- gsub('^1', 'Tuesday', news$weekday_is_tuesday)
news$weekday_is_wednesday <- gsub('^1', 'Wednesday', news$weekday_is_wednesday)
news$weekday_is_thursday <- gsub('^1', 'Thusday', news$weekday_is_thursday)
news$weekday_is_friday <- gsub('^1', 'Friday', news$weekday_is_friday)
news$weekday_is_saturday <- gsub('^1', 'Saturday', news$weekday_is_saturday)
news$weekday_is_sunday <- gsub('^1', 'Sunday', news$weekday_is_sunday)

Next I found a solution in this thread that used the dpyler::coalesce function to merge all the fields. I adapted this to my dataset as follows:

news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
  mutate(news, publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
                                   weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)

However, this only merges the fields from the first column (i.e. Monday):

     0 Monday 
 32983   6661

Where am I going wrong here?

score 2 · Answer 1 · answered Mar 13 '22 at 00:44

Here's one technique that uses reshaping pivot_longer, then gsub the literal out of the weekday_ column, then joining it back in.

quux <- read.csv("OnlineNewsPopularity.csv") # from your link

library(dplyr)
library(tidyr) # pivot_longer
quux2 <- quux %>%
  select(url, starts_with("weekday_is")) %>%
  pivot_longer(-url) %>%
  dplyr::filter(value > 0) %>%
  mutate(weekday = gsub("weekday_is_", "", name)) %>%
  left_join(quux, by = "url") %>%
  select(-name, -starts_with("weekday_is_"))
quux2
# # A tibble: 39,644 x 56
#    url          value weekday timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_uniq~ num_hrefs
#    <chr>        <dbl> <chr>       <dbl>          <dbl>            <dbl>           <dbl>            <dbl>            <dbl>     <dbl>
#  1 http://mash~     1 monday        731             12              219           0.664             1.00            0.815         4
#  2 http://mash~     1 monday        731              9              255           0.605             1.00            0.792         3
#  3 http://mash~     1 monday        731              9              211           0.575             1.00            0.664         3
#  4 http://mash~     1 monday        731              9              531           0.504             1.00            0.666         9
#  5 http://mash~     1 monday        731             13             1072           0.416             1.00            0.541        19
#  6 http://mash~     1 monday        731             10              370           0.560             1.00            0.698         2
#  7 http://mash~     1 monday        731              8              960           0.418             1.00            0.550        21
#  8 http://mash~     1 monday        731             12              989           0.434             1.00            0.572        20
#  9 http://mash~     1 monday        731             11               97           0.670             1.00            0.837         2
# 10 http://mash~     1 monday        731             10              231           0.636             1.00            0.797         4
# # ... with 39,634 more rows, and 46 more variables: num_self_hrefs <dbl>, num_imgs <dbl>, num_videos <dbl>,
# #   average_token_length <dbl>, num_keywords <dbl>, data_channel_is_lifestyle <dbl>, data_channel_is_entertainment <dbl>,
# #   data_channel_is_bus <dbl>, data_channel_is_socmed <dbl>, data_channel_is_tech <dbl>, data_channel_is_world <dbl>,
# #   kw_min_min <dbl>, kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
# #   kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
# #   self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>, LDA_04 <dbl>,
# #   global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>, ...

Proof of contents:

table(quux2$weekday)
#    friday    monday  saturday    sunday  thursday   tuesday wednesday 
#      5701      6661      2453      2737      7267      7390      7435

You might consider converting that into a factor if you intend to ever arrange on the weekday, since otherwise it'll sort them lexicographically (as shown above).

... %>%
  mutate(weekday = factor(weekday, levels = c("monday", "tuesday", "wednesday", ..., "sunday")))

FYI, the only reason I assigned the pipe from quux into a new variable quux2 was so that during your testing and evaluation of this, you would not inadvertently irreversibly overwrite your main dataset. Feel free to overwrite back onto itself.

The accepted answer to this post found the issue in my code, but this is probably a more reliable (and definitely tidier) method of doing it than my mess! Thanks for the answer, I appreciate it! — AgentNo, Mar 13 '22 at 00:52
No worries, different approaches help viewing the problem differently. Glad you got the fix you needed! — r2evans, Mar 13 '22 at 00:54

score 1 · Accepted Answer · answered Mar 13 '22 at 00:48

In the pipe operation, do not repeat inputting the data on the left hand side of mutate into the right hand side of it. That is the cause of your problem. You just need to remove news in mutate(news, publishing_day = ...)

news <- news %>% mutate_at(vars(starts_with("weekday_is")), funs(na_if(.,"0"))) %>%
  mutate(publishing_day = coalesce(weekday_is_monday, weekday_is_tuesday, weekday_is_wednesday, weekday_is_thursday,
                                         weekday_is_friday, weekday_is_saturday, weekday_is_sunday))
news$publishing_day <- as.factor(news$publishing_day)
summary(news$publishing_day)
#  Friday    Monday  Saturday    Sunday   Thusday   Tuesday Wednesday 
#    5701      6661      2453      2737      7267      7390      7435

How to merge data frame columns based on value in R

2 Answers2