0

I have the complete Project Gutenberg English library as alphabetized csv files with the columns - id, title, text. Here id is in the format /ebooks/15809. Then I am using the Wikipedia python package. I can get the full text of pages and a lot of other details using the package.

This is the first 10 books from Gutenberg -

    ['A Apple Pie',
     'A Apple Pie and Other Nursery Tales',
     'Aaron in the Wildwoods',
     'Aaron Rodd',
     "Aaron's Rod",
     'Aaron the Jew: A Novel',
     'Aaron Trow',
     'Abaft the Funnel',
     'Abandoned',
     'The Abandoned Country; or']

Now when I run pg = wikipedia.page('A Apple Pie'), I get the result for Apple Pie, the desert and not the book. Apparently how the API works is when we call wikipedia.page('xxxx') it does wikipedia.search('xxxx') which returns a list of the search results and returns the wiki page for the first result which in this case is -

>>> wikipedia.search('A Apple Pie')
['Apple pie', 'Pie', 'Apple Pie ABC', 'American Pie (film)', 'Sam Apple Pie', "Mom's Apple Pie", 'Apple Pie Hill', 'Pie à la Mode', 'Apple crisp', 'Pieing']
>>> 

Thus I actually need the third book on the list. A way I have figured out is looking into the categories for each entry in Gutenberg and Wikipedia.

As for the first book in Gutenberg, these are the categories it falls in -

s = 'https://www.gutenberg.org/ebooks/15809'

import requests
from bs4 import BeautifulSoup as bs

#page_url = base_url + alphabet
page = requests.get(s)
soup = bs(page.content, 'html.parser')
bibrec_tbl = soup.find("table", {"class": "bibrec"})
for td in list(bibrec_tbl.findChildren('td')):
    lowered = str(td).lower()
    if 'itemprop' in lowered:
        a = lowered[lowered.find('itemprop') + 10 :]
        b = a[: a.find('"')]
        print('itemprop', '\t', b, '\t', td.text.strip())
    elif 'property' in lowered:
        a = lowered[lowered.find('property') + 10 :]
        b = a[: a.find('"')]
        print('property', '\t', b, '\t', td.text.strip())



itemprop     creator     Greenaway, Kate, 1846-1901
itemprop     headline    A Apple Pie
property     dcterms:subject     Children's poetry
property     dcterms:subject     Nursery rhymes
property     dcterms:subject     Alphabet rhymes
property     dcterms:subject     Alphabet
property     dcterms:type    Text
itemprop     datepublished   May 10, 2005
property     dcterms:rights      Public domain in the USA.
itemprop     interactioncount    188 downloads in the last 30 days.
itemprop     pricecurrency   $0.00

And for the third Wikipedia result -

pg = wikipedia.page('Apple Pie ABC')
print(pg.categories)

['Alphabet books',
 'Articles with short description',
 'British picture books',
 'CS1 maint: discouraged parameter',
 'Commons category link is on Wikidata',
 "English children's songs",
 'English folk songs',
 'English nursery rhymes',
 'Short description matches Wikidata',
 "Traditional children's songs"]

So what I can do is do a cosine similarity between both categories, and hope that the threshold is close enough to match title to category.

Is there a better or more efficient way to do this? Thanks.

daddyodevil
  • 184
  • 2
  • 13
  • Does it help to enter a more specific search term for Wikipedia, e.g. if you include the author's name? – Pranav Hosangadi Jun 28 '21 at 16:58
  • @PranavHosangadi No it doesn't - ``` pages = wikipedia.search('A Apple Pie Greenaway, Kate') pages ['Apple pie', 'Apple Pie ABC', 'Kate Greenaway', 'Carnegie Medal (literary award)', "Guardian Children's Fiction Prize", 'List of cultural icons of England', '1971 in music', "List of children's literature writers", 'Culture of the United Kingdom', 'List of directorial debuts'] ``` – daddyodevil Jun 29 '21 at 06:59

0 Answers0