I have a folder structure containing thousands of HTML files that I'd like to cleanup and convert to markdown using pandoc, but keep in the exisiting structure (or mirror the structure).
I've currently managed to locate all HTML files using find
, passed that content using the cat
command to pup
which parses the content and looks at the <article>
tag and pipes the content to a new file called article-content.txt.
I was thinking of processing the content in two stages.
- Extract the article tag from each file and save as a new file (or overwrite the exisiting files).
- Then convert the same structure with pandoc.
My understanding of bash is limited. I understand I probably need to loop through the file list and pass the path / filenames as variables into a new file construct. But not sure where to go next.
cat $(find . -type f -name "*.html") | pup 'article' > article-content.txt