Wiki-markup stores links in between [[ and ]], that is, if I write [[PageTitle]], Wikipedia would generate a link to an internal page called PageTitle. However, there are other links that can be generated using [[ ]], such as Categories, Files, Help, Special etc...
To exclude these I have come up with the following regex:
\[\[(?!Category|Wikipedia|File|Help|User talk|Special)(.*?)\]\]
This works fine for most scenarios, except references (which I do not want in the first place). References are stored in ref XML tag (<ref></ref>
). For example:
<ref>"The remedy has been found: libertarian communism."
[http://www.theanarchistlibrary.org/HTML/Sebastien_Faure__Libertarian_Communism.html
[[Sébastien Faure]. "Libertarian Communism"]</ref>
Ideally, I would be in a position to skip references completely, but at the very least, reading items that do not have ] in the string would probably solve this.
I know that most of you will tell me not to use regex to parse wiki-markup, however I will be parsing all links within wikipedia (through their XML dump), this means that the lightest I can code this, the better.