0

I'm currently trying to convert wikitext table to HTML. (Parsoid isn't an option)

The tables are written in the below format. I want to regex the code for speed but I need a method to capture text between common search terms.

{| class=\"wikitable\"
|-
|'''Ruler'''
|'''Stopwatch'''
|'''Magnifying Glass'''
|-
|[[File:Ruler30cmDiagonal.png|center|200px]]
|[[File:Stopwatch.png|center|200px]]
|[[File:MagnifyingGlass.png|center|200px]]
|-
|A ruler is a piece of '''equipment''' used to measure length.
|A scientist came '''equip''' with a [[stopwatch]].
|A magnifying glass is a useful piece of '''equipment''' for looking at very small things.
|}

From the below I need to match the text between the '|-' substrings and finishing with the '|}'

So the matches will be

|'''Ruler'''
|'''Stopwatch'''
|'''Magnifying Glass'''

and

|A ruler is a piece of '''equipment''' used to measure length.
|A scientist came '''equip''' with a [[stopwatch]].
|A magnifying glass is a useful piece of '''equipment''' for looking at very small things.

and

|[[File:Ruler30cmDiagonal.png|center|200px]]
|[[File:Stopwatch.png|center|200px]]
|[[File:MagnifyingGlass.png|center|200px]]

As you can see there will be complications of missing the '|' character to matching needs to be done by character pairs. (I will also need to match by '\n|' on a later match/replace call)

Spent a good few hours on this, I know I'll need to have a lookahead and lookback (with an or for |- and }). I've come up with /((?=(\|\-))[.]*)(?!(\|\-|\|\}))/mg being the most likely candidate but no joy.

Any advice?

LeosSire
  • 93
  • 3
  • 13
  • 1
    I always suggest not to use directly regex if you are trying to make a parser, there are usefull tools online that guide and help you to achieve a simple parser from basic grammars, like [PEG.js](https://pegjs.org/online) for example. Trying to parse everything with a regex is an huge and worthless work. If you are lucky, wikitext tables are public domain obejcts, you may find some alredy done implementation – DDomen Feb 06 '21 at 01:33
  • Perhaps `(?<=\|-\n).*?(?=\s*\|[-}])`? https://regex101.com/r/uIvAN4/1/ – Nick Feb 06 '21 at 01:42
  • Does this answer your question? [Regular expression to get a string between two strings in Javascript](https://stackoverflow.com/questions/5642315/regular-expression-to-get-a-string-between-two-strings-in-javascript) – Nick Feb 06 '21 at 01:43

1 Answers1

0

I think regex is well suited for this task. As an added benefit, it's much faster than a Lex and Yacc approach. This code using several regexes handles the html rendering of your wiki text:

let input = `{| class=\"wikitable\"
|-
|'''Ruler'''
|'''Stopwatch'''
|'''Magnifying Glass'''
|-
|[[File:Ruler30cmDiagonal.png|center|200px]]
|[[File:Stopwatch.png|center|200px]]
|[[File:MagnifyingGlass.png|center|200px]]
|-
|A ruler is a piece of '''equipment''' used to measure length.
|A scientist came '''equip''' with a [[stopwatch]].
|A magnifying glass is a useful piece of '''equipment''' for looking at very small things.
|}`;

let classAttr = '';
let html = '<table>\n  ' + input
  .split(/[\r\n]+\|[\-\}]/)
  .filter((row, idx) => {
    if(idx === 0) {
      // class row on first line
      let m = row.match(/class=.?"([a-zA-Z_\- ]+)/);
      if(m) {
        // save the table class attribute for later use
        classAttr = ' class="' + m[1] + '"';
      }
      return false;
    } else if(row.length) {
      return true;
    }
    return false; // remove empty rows
  })
  .map((row) => {
    row = row
      .split(/[\r\n]+\|/)
      .filter((row, idx) => {
        if(idx === 0) {
          return false; // remove first empty item, not a cell
        }
        return true;
      })
      .map((cell) => {
        cell = '\n    <td> '
          + cell // do additional cell rendering as needed
          + ' </td>';
        return cell;
      })
      .join('');
    return '<tr>' + row + '\n  </tr>';
  })
  .join('\n  ') + '\n</table>';
// insert the table class attribute (if any)
html = html.replace(/(?<=<table)/, classAttr);

console.log(html);

Result:

<table class="wikitable">
  <tr>
    <td> '''Ruler''' </td>
    <td> '''Stopwatch''' </td>
    <td> '''Magnifying Glass''' </td>
  </tr>
  <tr>
    <td> [[File:Ruler30cmDiagonal.png|center|200px]] </td>
    <td> [[File:Stopwatch.png|center|200px]] </td>
    <td> [[File:MagnifyingGlass.png|center|200px]] </td>
  </tr>
  <tr>
    <td> A ruler is a piece of '''equipment''' used to measure length. </td>
    <td> A scientist came '''equip''' with a [[stopwatch]]. </td>
    <td> A magnifying glass is a useful piece of '''equipment''' for looking at very small things. </td>
  </tr>
</table>

See the // do additional cell rendering as needed comment, where you can resolve additional rendering, such as the bold text and links.

Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20