The Problem
I'm trying to write a simple intermediary step in a Pandoc workflow. I have an original document in .docx
which I'm converting to .md
using the --track-changes
switch (see Pandoc reader options for more information) to produce a markdown file which has MS word insertions/deletions/comments wrapped in span
tags, e.g.
[Insertion text]{.insertion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Deletion text]{.deletion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Comment body]{.comment-start id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}[]{.comment-end id="1"}
I want to run a regexp find and replace on the markdown file which effectively 'accepts' insertions and deletions but leaves the comment spans. (This is so when I convert back to .docx
, I have a clean .docx
file with comments only.)
What I've tried
I have been able to accept all insertion spans and delete all deletion spans, but only when the body text does not carry across more than one line. My attempt at matching across new lines matches too much and I can't work out how to match the exact text only.
The following regexp matches almost all deletions which I can then replace with nothing:
Find: \[(.*?)\]{.deletion(.|\n)*?}
Replace:
Same for insertions which I can then use a backreference to retain the text but remove the span:
Find: \[(.*?)\]{.insertion(.|\n)*?}
Replace: $1
The patterns are matching too much, though, as you can see here.
Please let me know if anything is unclear. I've been working on this quite a bit today and it's difficult to explain the problem plainly! Thanks in advance.