1

I'm currently using Wikipedia API to get some content that I can use on my website. At the moment when I get content it is all in html or wikitext (both containing Wikipedia hyperlink and a lot of junk in the text). Is there a way around this to just get plain text without having all this junk?

I have tried calling HTML and converting it into plain text but that still contains all of the wiki junk. I want to try and create a universal method that can remove all the junk as I want to be able to call multiple different Wikipedia pages and get plain text for all of these.

HTML:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=text&section=1&disabletoc=1

Wikitext:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Game_of_Thrones_(season_3)&prop=wikitext&section=1&disabletoc=1

I hope this makes sense, any advice/guidance is greatly appreciated.

blakeh
  • 73
  • 2
  • 7

0 Answers0