Regex removing bold markdown from inside codeblock only

Question

I'm editing in bulk some markdown files to be compliant with mkdocs syntax (material theme).

My previous documentation software accepted bold inside codeblock, but I discover now it's far from standard.

I've more than 10k codeblocks in this documentation, with more than 300 md files in nested directories, and most of them has ** in order to bold some word.

To be precise I should make any CodeBlock from this:

this is a **code block** with some commands

```my_lexer
enable
configure **terminal**
interface **g0/0**
```

to this

this is a **code block** with some commands

```my_lexer
enable
configure terminal
interface g0/0
```

The fun parts:

there are bold words in the rest of the document I would like to maintain (outside code block)
not every row of the code block has bold in it
not even every code block has necessarily bold in it

Now I'm using visual studio code with the substitute in files, and most of the easy regex I did for the porting is working. But it's not a perfect regex syntax (for examples, groups are denoted with $1 instead of \1 and maybe some other differences I don't know about).
But I accept other software (regex flavors) too if they are more regex compliant and accept 'replace in all files and subdirectories' (like notepad++, atom, etc..)

Sadly, I don't even know how to start something so complicated.
The most advanced I did is this: https://regex101.com/r/vRnkop/1 (there is also the text i'm using to test it)

(^```.*\n)(.*?\*\*(.*?)\*\*.*$\n)*

I hardly think this is a good start to do that!

Thanks

Can `**` appear in code-blocks when it is actually code (eg. C pointers) ? Can `**` span lines? — jhnc, Aug 21 '22 at 17:49

score 2 · Answer 1 · answered Aug 21 '22 at 17:49

2

Visual Studio is not my forté but I did read you should be able to use PCRE2 regex syntax. Therefor try to substitute the following pattern with an empty string:

\*\*(?=(((?!^```).)*^```)(?:(?1){2})*(?2)$)

See an online demo. The pattern seems a bit rocky and maybe someone else knows a much simpler pattern. However I did wanted to make sure this would both leave italic alone and would make bold+italic to italic. Note that . matches newline here.

answered Aug 21 '22 at 17:49

JvdV

70,606
8
39
70

This return an error: Invalid regula expression, invalid group. – Raikoug Aug 21 '22 at 18:56
If in the demo we put .NET (the vscode regex flavor) it returns the same error. This is a very nice regex indeed, thanks a lot for helping! – Raikoug Aug 21 '22 at 19:12

jhnc · Accepted Answer · 2022-08-21T19:41:29.100

1

If you have unix tools like sed. it is quite easy:

sed '/^```my_lexer/,/^```/ s/\*\*//g' orig.md >new.md

/regex1/,/regex2/ cmd looks for a group of lines where the first line matches the first regex and the final line matches the second regex, and then runs cmd on each of them. This limits the replacements to the relevant sections of the file.
s/\*\*//g does search and replace (I have assumed any instance of ** should be deleted

Some versions of sed allow "in-place" editing with -i. For example, to edit file.md and keep original version as file.md.orig:

sed -i.orig '...' file.md

and you can edit multiple files with something like:

find -name '*.md' -exec sed -i.orig '...' \{} \+

edited Aug 21 '22 at 19:41

answered Aug 21 '22 at 18:40

jhnc

11,310
1
9
26

I'm using this, but it just remove the first istance per row, just using it twice, perform as I need! This is awesome, MANY thanks! but I cannot understand how sed is working like this, normally I search globally, but here it seems you posed "limits" to it's research, am I wrong? – Raikoug Aug 21 '22 at 19:10
I should have had `g` flag to `s///` to delete all instances – jhnc Aug 21 '22 at 19:41
1

yes, sed comands take either zero, one or two ["addresses"](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13_01). `s///` can take up to two - zero = every line, one = specific line, two = range of lines. – jhnc Aug 21 '22 at 19:43
I've only one problem, I noticed that not ALL my code blocks have language, like in this pastebin https://pastebin.com/FajNQHmk and the double-stars in "normal text" get removed too (because I need to remove for the sed first address the "my_lexer" keyword). Now my chanche are either to put a language place-holder on the first triple-backtick (and I dunno how to distinguish between the first and the second!) or to parse everything wirth python and doing with it XD – Raikoug Aug 21 '22 at 20:12
1

`/^```/,/^```/ s/\*\*//g` should work, as long as they are always paired (It doesn't change the normal text in your pastebin example) – jhnc Aug 21 '22 at 20:17
Hi, now I had to add the language to all the code blocks, hence, to the first occurrence of triple-backtick for each couple, I actually managed to do it with regex, python, and bash. But I'm sure there is a more clever way to do it, this is the answer to my own question, if you have time an opinion from you could help improve my regex skills, thanks a lot for everything: https://stackoverflow.com/questions/73442927/adding-language-to-markdown-codeblock-in-bulk – Raikoug Aug 22 '22 at 09:34

Regex removing bold markdown from inside codeblock only

2 Answers2