1

I need to get the number of daily pageviews of the English Wikipedia article on "Dollar" and "Euro" from 06/2012-06/2016.

Raw dumps (*.bz2) are available at: https://dumps.wikimedia.org/other/pagecounts-ez/merged/

For example, https://dumps.wikimedia.org/other/pagecounts-ez/merged/pagecounts-2014-01-views-ge-5-totals.bz2 provides hourly/daily data for January 2014.

Problem: The unzipped files are too big to be opened in any text editor.

Desired solution: A Python script (?) that reads each of the .bz2 files, searches for the en wikipedia "Dollar" / "Euro" entry only and puts the daily pageviews into a Data Frame.

Hint: Using the Pageviews API (https://wikitech.wikimedia.org/wiki/Pageviews_API) won't be helpful as I'll need consistent data before 2015. stats.grok data (http://stats.grok.se/) is neither an option, as the generated data is different and incompatible.

osgx
  • 90,338
  • 53
  • 357
  • 513
JohnnyDeer
  • 231
  • 4
  • 14
  • FWIW, [vim](http://www.vim.org/) can handle arbitrarily large files without any problem. – Tgr Aug 22 '16 at 10:41
  • There is no need of any of this, you can just `bzgrep` the files since every line is about a single page. A script is needed only if you want to process the data, e.g. summing up the pageviews for redirects. – Nemo Nov 19 '16 at 13:35

1 Answers1

2

Probably the simplest solution would be to write your search script to read line by line from standard input (sys.stdin in Python; of course there's a Stack Overflow question about that too) and then piping the output of bzcat to it:

$ bzcat pagecounts-2014-01-views-ge-5-totals.bz2 | python my_search.py

Just make sure that your Python code indeed processes the input incrementally, rather than trying to buffer the entire input in memory at once.

This way, there's no need to complicate your Python script itself with any bzip2 specific code.

(This may also be faster than trying to do the bzip2 decoding in Python anyway, since the bzcat process can run in parallel with the search script.)

Community
  • 1
  • 1
Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153