Find all HTML files in a set of folders, extract specific HTML content and save content to new files

Question

I have a folder structure containing thousands of HTML files that I'd like to cleanup and convert to markdown using pandoc, but keep in the exisiting structure (or mirror the structure).

I've currently managed to locate all HTML files using find, passed that content using the cat command to pup which parses the content and looks at the <article> tag and pipes the content to a new file called article-content.txt.

I was thinking of processing the content in two stages.

Extract the article tag from each file and save as a new file (or overwrite the exisiting files).
Then convert the same structure with pandoc.

My understanding of bash is limited. I understand I probably need to loop through the file list and pass the path / filenames as variables into a new file construct. But not sure where to go next.

cat $(find . -type f -name "*.html") | pup 'article' > article-content.txt

This question is too broad as it is. Try and boil it down to a single, concrete problem and let us know — oguz ismail, Oct 08 '19 at 15:57
I'm not sure I understand your goal. Do you, for each HTML file, want to extract the contents of the `
` tag, then convert that to markdown and store it in a new file? What would your flow look like for a single input file? — Benjamin W., Oct 08 '19 at 16:44
Thanks Benjamin. You’ve understood me correctly. Extract the article tag and contents and save a new file based on its original file name with the md extension. — Richard Saunders, Oct 08 '19 at 17:37

Jeff Y · Accepted Answer · 2019-10-08T19:09:37.873

0

If you want to perform a similar action on each file individually, find has the -exec and -execdir options built in for just that purpose (see man find):

find . -type f -name "*.html" -execdir bash -c "pup 'article' < {} > {}.md" \;

edited Oct 08 '19 at 19:09

answered Oct 08 '19 at 16:56

Jeff Y

2,437
1
11
18

I got a parse error using this `parse error near ")"`. I think it was missing quotes aroung the rounded brackets. When they were added it seemed to run but failed with an error like this for every file: `find: (pup 'article' < character-count.aspx.html > character-count.aspx.html.md): No such file or directory`. – Richard Saunders Oct 08 '19 at 18:14
Yeah, sorry about that. Per https://stackoverflow.com/questions/15030563/redirecting-stdout-with-find-exec-and-without-creating-new-shell it needs to be "wrapped" with `bash -c`. I have edited the answer. – Jeff Y Oct 08 '19 at 19:11
1

Thanks @jeff-y that worked. Just need to do some cleanup on the file name and I have got the second process working converting the HTML to Markdown using pandoc. – Richard Saunders Oct 09 '19 at 07:35

Find all HTML files in a set of folders, extract specific HTML content and save content to new files

1 Answers1