Wikipedia API - How to get rid of Wikipedia hyperlinks/junk

Asked Oct 03 '21 at 21:16

Active Oct 03 '21 at 21:16

Viewed 185 times

I'm currently using Wikipedia API to get some content that I can use on my website. At the moment when I get content it is all in html or wikitext (both containing Wikipedia hyperlink and a lot of junk in the text). Is there a way around this to just get plain text without having all this junk?

I have tried calling HTML and converting it into plain text but that still contains all of the wiki junk. I want to try and create a universal method that can remove all the junk as I want to be able to call multiple different Wikipedia pages and get plain text for all of these.

HTML:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=text&section=1&disabletoc=1

Wikitext:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=wikitext&section=1&disabletoc=1

I hope this makes sense, any advice/guidance is greatly appreciated.

asked Oct 03 '21 at 21:16

blakeh

1

see https://stackoverflow.com/questions/4452102/how-to-get-plain-text-out-of-wikipedia – Pascalco Oct 04 '21 at 19:43

Wikipedia API - How to get rid of Wikipedia hyperlinks/junk

0 Answers0