0

Wiki-markup stores links in between [[ and ]], that is, if I write [[PageTitle]], Wikipedia would generate a link to an internal page called PageTitle. However, there are other links that can be generated using [[ ]], such as Categories, Files, Help, Special etc...

To exclude these I have come up with the following regex:

\[\[(?!Category|Wikipedia|File|Help|User talk|Special)(.*?)\]\]

This works fine for most scenarios, except references (which I do not want in the first place). References are stored in ref XML tag (<ref></ref>). For example:

<ref>"The remedy has been found: libertarian communism."
[http://www.theanarchistlibrary.org/HTML/Sebastien_Faure__Libertarian_Communism.html 
[[Sébastien Faure]. "Libertarian Communism"]</ref>

Ideally, I would be in a position to skip references completely, but at the very least, reading items that do not have ] in the string would probably solve this.

I know that most of you will tell me not to use regex to parse wiki-markup, however I will be parsing all links within wikipedia (through their XML dump), this means that the lightest I can code this, the better.

Daryl
  • 117
  • 1
  • 7
  • 1
    what language/tools are you using? – Bohemian Mar 17 '14 at 22:58
  • C#... although I would consider alternatives if need be... – Daryl Mar 17 '14 at 23:00
  • 1
    First of all, [Wiki Syntax isn't regular](http://stackoverflow.com/a/1732454/764357), but knowing that if you want to [treat it as a regular language you can, you just need to be able to express the *exact* things you need to extract](http://stackoverflow.com/a/1733489/764357). If you can add that in, that would be much more likely to illicit an accurate answer. –  Mar 17 '14 at 23:14

1 Answers1

1

Since it looks like you can use look arounds
Test case added

 # \[\[(?!Category|Wikipedia|File|Help|User\ talk|Special)((?:(?![\[\]]).)*)\]\]

 \[\[
 (?!
      Category
   |  Wikipedia
   |  File
   |  Help
   |  User\ talk
   |  Special
 )
 (
      (?:
           (?! [\[\]] )
           . 
      )*
 )
 \]\]

Perl test case

$/ = undef;

$str = <DATA>;

while ( $str =~ /\[\[(?!Category|Wikipedia|File|Help|User\ talk|Special)((?:(?![\[\]]).)*)\]\]/g )
{
    print "$1\n";
}


__DATA__

[[Link 1]] and [[Link 2]] 
ref>"The remedy has been found: libertarian communism."
[http://www.theanarchistlibrary.org/HTML/Sebastien_Faure__Libertarian_Communism.html 
[[Sébastien Faure]. "Libertarian Communism"]</ref>
[[Link 3]] and [[Link 4]] 

Output >>

Link 1
Link 2
Link 3
Link 4
  • Just tried this one, it seems to start off from the very first [[ within the text, and ends at the last ]]. So if I have [[Link 1]] and [[Link 2]] it would give me "[[Link 1]] and [[Link 2]]". – Daryl Mar 17 '14 at 23:14
  • Really? Let me put up a test case for you. –  Mar 17 '14 at 23:16
  • My bad!! I had a space somewhere...sorry. It looks like it works fine. Thanks – Daryl Mar 17 '14 at 23:20
  • 1
    Ok, test case posted added. –  Mar 17 '14 at 23:24