1

I have 2 datasets in R. First is a dump from Google Analytics with pageviews/sessions/users data and the second is cms metadata export with articleIDs, Author names, published date, etc and costs per article.

The first one looks something like this

Summarize numeric variables:
                      n_obs n_missing n_distinct          mean median   min                max   p25    p75              sd            se
            sessions 10,000         0        151     7,433.648  1.000 0.000     74,116,646.000 1.000  1.000     741,166.356     7,411.664
           pageviews 10,000         0        198    11,409.880  1.000 0.000    113,787,288.000 1.000  2.000   1,137,872.716    11,378.727
               users 10,000         0        179     9,579.513  1.000 1.000     95,541,309.000 1.000  1.000     955,412.937     9,554.129
             bounces 10,000         0         85     4,404.562  0.000 0.000     43,970,642.000 0.000  0.000     439,706.387     4,397.064
           entrances 10,000         0        151     7,418.493  1.000 0.000     73,966,090.000 1.000  1.000     739,660.797     7,396.608
 pageviewsPerSession 10,000         0        357         1.207  1.000 0.000            102.000 1.000  1.000           1.920         0.019
     sessionDuration 10,000         0      1,282 1,052,179.991  8.000 0.000 10,500,469,474.000 1.000 40.000 105,004,691.642 1,050,046.916

Earliest dates:
 date
 <NA>

Final dates:
 date
 <NA>

Summarize character variables (< 20 unique values shown):
pagePath (n_distinct 10000):  (other) / /?/= /?a= /?co= /?fbclid=IwAR0a9JQDUbU4iViMvLBpCsreeox2l1tCW3pO3fVSfaa1Fq3e_5PkQz77yFs 

The second one looks like this

Summarize numeric variables:
                 n_obs n_missing n_distinct        mean     median        min         max        p25         p75         sd      se
     ArticleID 115,383         0    115,383 104,641.445 91,149.000 31,224.000 190,569.000 60,119.500 160,555.500 51,530.762 151.704
 CommentsCount 115,383         0        441       5.663      0.000      0.000   1,108.000      0.000       1.000     27.952   0.082
          Cost 115,383         0        165                  0.000      0.000                  0.000       0.000   

Earliest dates:
 PublishedDate
          <NA>

Final dates:
 PublishedDate
          <NA>

Summarize character variables (< 20 values shown):
URL : 
Title :
Origin : 
Author : 
Category : 
Tags : 

After cleaning the pagepath and normalizing urls I want to merge both with inner join so that only data on articles remains. However I'm struggling with a correct way to aggregate cost data so that column values would not add up each time there's a new date and pageview for a certain page

The second thing is to make something of a relational database for each tag that are listed as a comma separated string for each article -- in other words to make each tag a separate dimension

halfer
  • 19,824
  • 17
  • 99
  • 186
dan
  • 33
  • 4
  • Can you make your question [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Conor Apr 03 '20 at 13:38

0 Answers0