Extracting Wikimedia pageview statistics

Question

Wikipedia provides all their page views in a hourly text file. (See for instance http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/)

For a project is need to extract keywords and their associated page views for the year 2014. But seeing that one file (representing 1 hour, consequently totalling 24*365 files) is ~80MB. This can be a hard task doing manual.

My questions: 1. Is there any way to download the files automatically? (the files are structured properly this could be helpful)

score 1 · Answer 1 · answered Jul 25 '15 at 13:50

1

Download? Sure, that's easy:

wget -r -np http://dumps.wikimedia.org/other/pagecounts-raw/

Recursive wget does it. Note, these files are deprecated now; you probably want to use http://dumps.wikimedia.org/other/pagecounts-all-sites/ instead.

answered Jul 25 '15 at 13:50

Nemo

2,441
2
29
63

score 0 · Answer 2 · answered Sep 10 '15 at 18:30

0

I worked on this project: https://github.com/idio/wikiviews you just call it like python wikiviews 2 2015 and it will download all the files for February 2015, and join them in a single file.

answered Sep 10 '15 at 18:30

David Przybilla

830
6
16

Extracting Wikimedia pageview statistics

2 Answers2