Sometimes project Gutenberg includes the author or book name in a machine readable way in the raw text files but many times it doesn't. I have a collection of project Gutenberg raw text files that I would like to use and quote from using software (normally python3 or shell) but I would like to get the author and book name to go with it for future reference. Would nltk be able to do it?
Asked
Active
Viewed 68 times
0
-
The book page has that meta data : https://www.gutenberg.org/ebooks/100 – JonSG Feb 15 '22 at 17:40
-
What do these files look like? – Jan Wilamowski Feb 16 '22 at 05:35
-
The files are plain text with Project Gutenberg's addresses but things like title and author are displayed inconsistantly and are hard to parse. I couldn't use JonSG's suggestion for a scraper as I would get IP banned. However I found a much better way and that is to simply hyperlink to the file I am using on Gutenberg's site. I wish I had a slicker way but ah, this works. – Ohiovr Mar 06 '22 at 23:03