1

Goal

Using PowerShell, find a string in a file, run a simple transformation script on the string, and replace the original string with the new string in the same file

Details

  • The file is a Markdown file with one or more HTML blocks inside.
  • The goal is to make the entire file Markdown with no HTML.
  • Pandoc is a command-line HTML-to-Markdown transformation tool that easily transforms HTML to Markdown.
  • The transformation script is a Pandoc script.
  • Pandoc alone cannot transform a Markdown file that includes HTML to Markdown.
  • Each HTML block a is one long string with no line breaks (see example below).
  • The HTML is a little rough and sometimes not valid; despite this, Pandoc handles much of the transformation successfully. This may not be relevant.
  • I cannot change the fact that the file is generated originally as part Markdown/part HTML, that the HTML is sometimes invalid, or that each HTML block is all on one line.
  • PowerShell is required because that's the scripting language my team supports.

Example file of mixed Markdown/HTML code; most HTML is invalid

# Heading 1
Text

# Heading 2
<h3>Heading 3</h3><p>I am all on one line</h><span><div>I am not always valid HTML</div></span><br><h4>Heading 4<h4><ul><li>Item<br></li><li>Item</li><ul><span></span><img src="url" style="width:85px;">

# Heading 3
Text

# Heading 4
<h2>Heading 1</h2><div>Text</div><h2>Heading 2</h2><div>Text</div>

# Heading 5
<div><ul><li>Item</li><li>Item</li><li>Item</li></ul></div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code>

Text

Code for transformation script

pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

Attempts

I surrounded each HTML block with a <start> and <end> tag with the goal to extract the text in between those tags with a regex, run the Pandoc script on it, and replace the original text. My plan was to run a foreach loop to iterate through each block one by one.

This attempt transforms the HTML to Markdown, but does not return the original Markdown with it:

$file = 'file.md'
$regex = '<start>.*?<end>'
$a = Get-Content $file -Raw
$a | Select-String $regex -AllMatches | ForEach-Object {$_.Matches.Value} | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

This poor attempt seeks to perform the replace, but only returns the original file with no changes:

$file = 'file.md'
$regex = '<start>.*?<end>'
$content = Get-Content $file -Raw

$a = $content | Select-String $regex -AllMatches
$b = $a | ForEach-Object {$_.Matches } | Foreach-Object {$_.Value} | Select-Object | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

$content | ForEach-Object {
    $_ -replace $a,$b }

I am struggling to move beyond these attempts. I am new at PowerShell. If this approach is wrong entirely I would be grateful to know. Thank you for any advice.

hcdocs
  • 1,078
  • 2
  • 18
  • 30

1 Answers1

1

Given the line-oriented nature of your input, you can process your input file line by line and decide for each line whether it needs transformation or not:

$file = 'file.md'
(Get-Content $file | ForEach-Object {
  if ($_ -match '^<') { # Is this an HTML line? - you could make this regex stricter
    $_ | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
  } else { # A non-HTML line, pass through as-is
    $_
  }
}) | Set-Content -Encoding Utf8 $file # be sure to choose the desired encoding

Note the (...) around the pipeline before Set-Content, which ensures that $file is read into memory in full up front, which allows writing back to the same file - do note that this convenient approach bears the slight risk of data loss, however, if the command is interrupted before writing completes; always create a backup of the input files first.

mklement0
  • 382,024
  • 64
  • 607
  • 775