3

Wikipedia provides all their page views in a hourly text file. (See for instance http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/)

For a project is need to extract keywords and their associated page views for the year 2014. But seeing that one file (representing 1 hour, consequently totalling 24*365 files) is ~80MB. This can be a hard task doing manual.

My questions: 1. Is there any way to download the files automatically? (the files are structured properly this could be helpful)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129

2 Answers2

1

Download? Sure, that's easy:

wget -r -np http://dumps.wikimedia.org/other/pagecounts-raw/

Recursive wget does it. Note, these files are deprecated now; you probably want to use http://dumps.wikimedia.org/other/pagecounts-all-sites/ instead.

Nemo
  • 2,441
  • 2
  • 29
  • 63
0

I worked on this project: https://github.com/idio/wikiviews you just call it like python wikiviews 2 2015 and it will download all the files for February 2015, and join them in a single file.

David Przybilla
  • 830
  • 6
  • 16