0

I am new to this (first question). I have a huge news articles dataset (available at Kaggle: https://www.kaggle.com/snapcrack/all-the-news) with 100's or even 1000's of articles for each day. They are not consistently distributed.

I need to take a sample of news articles (lets say 20) for each & every day within the dataset to reduce the size and have consistent number of articles for each day. I then want to use it for further predictive analysis along with another dataset.

So my first question is, how can I sample/subset dataset based on date. I know how to sample dataset in general but not how to do so consistently so that I have articles from each day. I guess it will be a function as dataset has articles over three years, so it will have to be ran over that period.

Secondly, is it possible to show that sample for each day in a single row? so an article per column.

I am currently using Rstudio. Given its my first post, I cannot post the pictures.

data,articles

  • For sampling, `your_data %>% group_by(your_date_column) %>% sample_n(20)`. For your second question, see the [FAQ about transforming data from long to wide](https://stackoverflow.com/q/5890584/903061). – Gregor Thomas Dec 01 '20 at 21:53
  • If you need more help than that, please share a minimal reproducible example - e.g., 5 abbreviated articles from each of 3 days, as well as your attempt based on the linked questions. `dput()` is the nicest way to share sample data because it is copy/pasteable and preserves the class and structure info. – Gregor Thomas Dec 01 '20 at 21:58
  • @GregorThomas Thanks a lot for your response. Really appreciate the help. – Asad Khan Dec 01 '20 at 21:59
  • (And the sampling code is with `library(dplyr)`, forget to mention that.) – Gregor Thomas Dec 01 '20 at 22:04
  • 1
    @GregorThomas Thanks, it worked like a charm. – Asad Khan Dec 01 '20 at 22:18

0 Answers0