3

I am currently using pywikibot to obtain the categories of a given wikipedia page (e.g., support-vector machine) as follows.

import pywikibot as pw

print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])

The results I get is:

[
  'Category:All articles with specifically marked weasel-worded phrases',
  'Category:All articles with unsourced statements',
  'Category:Articles with specifically marked weasel-worded phrases from May 2018',
  'Category:Articles with unsourced statements from June 2013',
  'Category:Articles with unsourced statements from March 2017',
  'Category:Articles with unsourced statements from March 2018',
  'Category:CS1 maint: Uses editors parameter',
  'Category:Classification algorithms',
  'Category:Statistical classification',
  'Category:Support vector machines',
  'Category:Wikipedia articles needing clarification from November 2017',
  'Category:Wikipedia articles with BNF identifiers',
  'Category:Wikipedia articles with GND identifiers',
  'Category:Wikipedia articles with LCCN identifiers'
]

As you can see the results I am getting include lot of tracking and maintenance categories of wikipedia such as;

  • Category:All articles with specifically marked weasel-worded phrases
  • Category:All articles with unsourced statements
  • Category:CS1 maint: Uses editors parameter
  • etc.

However, the categories I am only interested are;

  • Category:Classification algorithms
  • Category:Statistical classification
  • Category:Support vector machines

I am wondering if there is a way to get all tracing or maintenance wikipedia categories, so that I can remove them from the results to get only the informative categories.

Or, please suggest me if there are any other ways of eliminating them from the results.

I am happy to provide more details if needed.

EmJ
  • 4,398
  • 9
  • 44
  • 105
  • Are `tracing` and `maintenance` artifacts of the actual library you are using, or your own terminology? If the library doesn't provide further identification of categories, you could simply filter based on known keywords in a list comprehension. E.g. `[cat for cat in categories if not any(exclude_keyword in cat for exclude_keyword in ['disputed', 'maintenance', ...])]` – jpriebe Feb 05 '19 at 03:18
  • @jpriebe thanks a lot for the comment. Actually `tracking` and `maintenance` are words I took from wikipedia (Look at this link of wikipedia: https://en.wikipedia.org/wiki/Category:CS1_maint:_Uses_editors_parameter). It seems like wikipedia have categorised the category links of them as `tracking/maintenance` – EmJ Feb 05 '19 at 03:27

1 Answers1

3

pywikibot currently does not provide some of the API features for filtering hidden categories. You can do that manually by searching for the hidden key in categoryinfo:

import pywikibot as pw

site = pw.Site('en', 'wikipedia')
print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

gives:

['Category:Classification algorithms', 
 'Category:Statistical classification', 
 'Category:Support vector machines']

See https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories for more info.

AXO
  • 8,198
  • 6
  • 62
  • 63