I have 2 datasets in R. First is a dump from Google Analytics with pageviews/sessions/users data and the second is cms metadata export with articleIDs, Author names, published date, etc and costs per article.
The first one looks something like this
Summarize numeric variables:
n_obs n_missing n_distinct mean median min max p25 p75 sd se
sessions 10,000 0 151 7,433.648 1.000 0.000 74,116,646.000 1.000 1.000 741,166.356 7,411.664
pageviews 10,000 0 198 11,409.880 1.000 0.000 113,787,288.000 1.000 2.000 1,137,872.716 11,378.727
users 10,000 0 179 9,579.513 1.000 1.000 95,541,309.000 1.000 1.000 955,412.937 9,554.129
bounces 10,000 0 85 4,404.562 0.000 0.000 43,970,642.000 0.000 0.000 439,706.387 4,397.064
entrances 10,000 0 151 7,418.493 1.000 0.000 73,966,090.000 1.000 1.000 739,660.797 7,396.608
pageviewsPerSession 10,000 0 357 1.207 1.000 0.000 102.000 1.000 1.000 1.920 0.019
sessionDuration 10,000 0 1,282 1,052,179.991 8.000 0.000 10,500,469,474.000 1.000 40.000 105,004,691.642 1,050,046.916
Earliest dates:
date
<NA>
Final dates:
date
<NA>
Summarize character variables (< 20 unique values shown):
pagePath (n_distinct 10000): (other) / /?/= /?a= /?co= /?fbclid=IwAR0a9JQDUbU4iViMvLBpCsreeox2l1tCW3pO3fVSfaa1Fq3e_5PkQz77yFs
The second one looks like this
Summarize numeric variables:
n_obs n_missing n_distinct mean median min max p25 p75 sd se
ArticleID 115,383 0 115,383 104,641.445 91,149.000 31,224.000 190,569.000 60,119.500 160,555.500 51,530.762 151.704
CommentsCount 115,383 0 441 5.663 0.000 0.000 1,108.000 0.000 1.000 27.952 0.082
Cost 115,383 0 165 0.000 0.000 0.000 0.000
Earliest dates:
PublishedDate
<NA>
Final dates:
PublishedDate
<NA>
Summarize character variables (< 20 values shown):
URL :
Title :
Origin :
Author :
Category :
Tags :
After cleaning the pagepath and normalizing urls I want to merge both with inner join so that only data on articles remains. However I'm struggling with a correct way to aggregate cost data so that column values would not add up each time there's a new date and pageview for a certain page
The second thing is to make something of a relational database for each tag that are listed as a comma separated string for each article -- in other words to make each tag a separate dimension