0

I am in the process of moving a static HTML onto WordPress.

I am trying to figure a way in which I can pull specific HTML content from the files(title tags, description tags, <h1> tags, etc.). I have around 120 local files and doing it all by hand would be a long process.

However, if I could get this data into a CSV I can quickly move this site.

Does anyone have any advice or experience with this type of process? Any help would be greatly appreciated.

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
Enigma
  • 21
  • 3
  • 1
    Load each html into a browser (or similar) and then simply pull its elements and their content using DOM methods ... and I also voted to close this question as "searching for a tool", which is off topic here at SO – Asons Oct 13 '16 at 18:12
  • Thanks for trying to close my question when I"m just asking for direction not for someone to do it for me. – Enigma Oct 13 '16 at 19:06
  • 1
    I gave a suggestion to help out, ... and this question is off topic here at SO, read our help center and you'll find it there – Asons Oct 13 '16 at 19:15
  • Write a script to parse the files? With HTML this can be a tricky thing. See also [Why not to parse HTML using RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) or [this](http://stackoverflow.com/a/590789/6590026). – Seth Oct 14 '16 at 08:57
  • I ended up having to do this by hand. I was able to do all the pages in one pass in sublime text using regex. Had to pull some tricks but in general, I've gotten it down decently well. – Enigma Oct 17 '16 at 18:55

1 Answers1

0

The question is about extracting certain HTML elements, out of a given HTML file. There are multiple ways to do this. Let me point out some of them below.

1) Use a script with a Library to do this. For Java use JSOUP.

String br = "<html><source>foo bar bar</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());

for (Element sentence : doc.getElementsByTag("source"))
    System.out.println(sentence.text());
}

This will give you the list of elements with the HTML tag source. You can do the same for other languages like python (use BeautifulSoup) and NodeJS.

2) You can write a script to read HTML files as text files and do a search on text.

Move all your HTML files into a folder, and write a small program to load each file and search for the specific tags. Later save it to a CSV or any preferred output.

3) You can do the same with grep.

Simple do a search and load the results directly into a CSV file.

There are multiple other ways to do it. Since you mentioned that the manual workload is higher, try doing a small script to get the job done. Use the first approach as it is faster and easier.

Keet Sugathadasa
  • 11,595
  • 6
  • 65
  • 80