1

I am trying to scrape wiki travel for specific data. like climate, getin etc. I have managed to get the xml from them with special export.

http://wikitravel.org/en/Special:Export/San_Francisco I got the data in xml form but it is in wiki markup and I tried browsing for a solution to get that text, but was unable to find a suitable solution.

I tried writing a php function with regular expressions so i can convert it into html, but it gets converted in a non uniform manner so very difficult to select specific data.

Also tried writing mediawiki url so I can program something http://wikitravel.org/en/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content But it does not work.

Can you please help me with this. Has anyone successfully scraped wikipedia. I there a tutorial or any other technique I can refer.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
hungry fish
  • 188
  • 1
  • 3
  • 12

2 Answers2

1

There's a similar question here: Where can I find a good MediaWiki Markup parser in PHP?

I also found this: https://github.com/codeholic/w/blob/master/creole.php Which came from: http://www.ivan.fomichev.name/2010/02/php-creole-10-wiki-markup-parser.html

This sounds like a frustrating endeavour, I wish you the best of luck!

Community
  • 1
  • 1
jon
  • 5,986
  • 5
  • 28
  • 35
0

Wikitravel's MediaWiki API is at http://wikitravel.org/wiki/en/api.php, so try this instead:

http://wikitravel.org/wiki/en/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content

You will want to use an API client, see http://www.mediawiki.org/wiki/API:Client_code for a selection. Also beware that Wikitravel uses a very old version of MediaWiki (1.11), so many operations in the modern API do not work.

lambshaanxy
  • 22,552
  • 10
  • 68
  • 92