1

I am currently working on the data cleaning of a sentiment analysis, and am using a large dataset of news articles in the form of a data frame. I need to be able to analyze one article per row of the data frame, and am looking for a way to remove line breaks between the first ‘======‘ and the second ‘======‘, repeating throughout the entire dataframe. Also, after the content has «collapsed onto itself», I would like the publisher and date column to remain.

df <-  matrix(c("======","NA","NA","Daily Bugle Dec 31","Daily Bugle", "Dec 31" ,"Wookies are","NA","NA",". recreationally", "NA","NA", "using drugs at a", "NA", "NA", "higher rate than", "NA", "NA","ever before.", "NA", "NA","======", "NA", "NA" ),ncol=3,byrow=TRUE)
colnames(df) <- c("content","publisher","date")
df <- as.data.frame(df)
df[ df == "NA" ] <- NA

Gives this:

content              publisher   date
======               <NA>         <NA>
Daily Bugle, Dec 31  Daily Bugle Dec 31
Wookies are          <NA>         <NA>
recreationally       <NA>         <NA>
using drugs at a     <NA>         <NA>
higher rate than     <NA>         <NA>
ever before.         <NA>         <NA>
======               <NA>         <NA>

I would like something like this:

content                                           publisher     date
======
Wookies are recreationally using drugs at a hig... Daily Bugle Dec 31           
======
Article 2
======
Article 3
======

Hope this was clear. I am relatively new to R.

  • 1
    You could improve your chances of finding help here by adding a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Adding a MRE and an example of the desired output (in code form, not tables and pictures) makes it much easier for others to find and test an answer to your question. That way you can help others to help you! P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) – dario Oct 09 '21 at 12:36
  • Thanks for the tip, dario! Im new to stack, so all help is appreciated. I´ll edit this into a better version of the question. – Joikakake Oct 09 '21 at 12:48
  • At some point you'll want to `gsub('[\\.]', '', df1$content)` as '.' won't add much to sentiment analysis. – Chris Oct 09 '21 at 15:15

2 Answers2

3
  • Every article starts with '===' so that can be used as an article number.
  • Drop the first value of content for each article.
  • Keep the 1st value of publisher and date.
library(dplyr)

df %>%
  mutate(article_no = cumsum(grepl('===', content))) %>%
  filter(!grepl('===', content)) %>%
  group_by(article_no) %>%
  summarise(content = paste0(content[-1], collapse = ''), 
            publisher = publisher[1], 
            date = date[1])

#  article_no content                                                                 publisher   date  
#       <int> <chr>                                                                   <chr>       <chr> 
#1          1 Wookies are. recreationallyusing drugs at ahigher rate thanever before. Daily Bugle Dec 31
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • For the further sentiment analysis, `summarise collapse = ' '` i.e. space? – Chris Oct 09 '21 at 14:46
  • I love this site! I found this to be a more efficient way of solving the problem than the one above, but both work great. Thanks so much! You and Marek are both invited to my future wedding. – Joikakake Oct 09 '21 at 22:43
1

To help you, first I need to prepare some data.

library(tidyverse)
articles = read.table(
  header = TRUE,sep = ",",text="
content,publisher,date
======,NA,NA
Daily News Dec 27,Daily News,Dec 27
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 28,Daily News,Dec 28
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 30,Daily News,Dec 30
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily Bugle Dec 31,Daily Bugle,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Weekly News Dec 31,Weekly News,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA") %>%
  as_tibble() %>% 
  mutate(publisher = ifelse(publisher=="NA", NA, publisher),
         date = ifelse(date=="NA", NA, date))
articles

output

# A tibble: 52 x 3
   content           publisher  date  
   <chr>             <chr>      <chr> 
 1 ======            NA         NA    
 2 Daily News Dec 27 Daily News Dec 27
 3 Wookies are       NA         NA    
 4 . recreationally  NA         NA    
 5 using drugs at a  NA         NA    
 6 higher rate than  NA         NA    
 7 using drugs at a  NA         NA    
 8 higher rate than  NA         NA    
 9 using drugs at a  NA         NA    
10 higher rate than  NA         NA    
# ... with 42 more rows

I hope this is what your data format is. For me, these are five articles.

Now let's add one convert function and a simple mutation.

fConvert = function(data) tibble(
  publisher = data$publisher[2],
  date = data$date[2],
  content = data %>% slice(3:(nrow(.)-1)) %>% 
    pull(content) %>% paste(collapse = " ")
)

articles %>% mutate(
  idArticle = ifelse(!is.na(publisher),1, 0) %>% 
    cumsum() %>% lead(default=.[length(.)]) 
) %>% group_by(idArticle) %>% 
  nest() %>% 
  group_modify(~fConvert(.x$data[[1]]))

output

# A tibble: 5 x 4
# Groups:   idArticle [5]
  idArticle publisher   date   content                                                                                            
      <dbl> <chr>       <chr>  <chr>                                                                                              
1         1 Daily News  Dec 27 Wookies are . recreationally using drugs at a higher rate than using drugs at a higher rate than u~
2         2 Daily News  Dec 28 Wookies are . recreationally using drugs at a higher rate than ever before. ever before. ever befo~
3         3 Daily News  Dec 30 Wookies are . recreationally using drugs at a higher rate than ever before. ever before.           
4         4 Daily Bugle Dec 31 Wookies are . recreationally using drugs at a higher rate than ever before.                        
5         5 Weekly News Dec 31 Wookies are . recreationally higher rate than ever before.     

As you can see, I was able to extract five articles, despite their different lengths, and glue all the lines together into one content. Hope that's what you meant.

Marek Fiołka
  • 4,825
  • 1
  • 5
  • 20
  • Thank you so much for the help! I tried this and it worked greatly! I would upvote your comment, but apparently you need at least 15 reputation to be able to do that. – Joikakake Oct 09 '21 at 22:41
  • Welcome to Stack Overflow! I'm glad I could help. I understand, also at the beginning a lot of things about Stack Overflow itself can be confusing. I myself started a few months ago and I perfectly remember how confused I was at the time. Reputation points, reply comments, flags, badges etc. etc. You can go wrong with that. Regarding the "This answer is useful" tag, it's not really me, but you need 15 reputation points. – Marek Fiołka Oct 10 '21 at 20:03
  • However, you can always change your mind and mark a different answer as accepted. You don't need 15 reputation points for this. Of course, I'm not trying to get you to do something. Decide for yourself what is most clear and useful for you. If you want to quickly gain reputation points, switch to another service. For example, on [Cross Validated](https://stats.stackexchange.com/) or on any other StackExchange expert community. – Marek Fiołka Oct 10 '21 at 20:04