1

hey can any one tell me to write the command in terminal to extract text from a html file using tags like <li>, <strong>, <b>, <title>, <td>...etc...and $var="strings" and javascript functions using msgstring....

->i am thinking of putting these tags in a text file...

->then i want to match the tags with the help of command of terminal...

->then i have to put that into a dump file(text)...

because...i want to change the text with language preference....

i tried with awk script and egrep too....but i got poor result...

Priya
  • 1,522
  • 1
  • 14
  • 31

4 Answers4

4

This is exactly what pandoc is for.

pandoc filename.html -f html -t plain -o filename.txt

As a bonus, resulting plain text is beautifully formatted.

See Pandoc Manual.

2

Doing this with awk and egrep would probably mean using regular expressions to parse HTML. This is a bad idea. See this famous answer

Rather, use an HTML parser. See other answers in the link above for links to HTML parsers.

As to parsing PHP source code:

As it is structurally similar to HTML, you might be able to use a (tolerant) HTML parser. Otherwise, use a PHP parser. See e.g. this answer.

Community
  • 1
  • 1
sleske
  • 81,358
  • 34
  • 189
  • 227
1

Use regex like this:

perl -pne '/<strong>(.*)?<\/strong>/;' file

Of course, your regex will be more complex, I guess.

sw0x2A
  • 298
  • 2
  • 7
  • like this it'll take a lot time... i have to enter each tag once and then parse it.... it'll be better if i'll put all tags in a text file and match that file with the file i need to parse.... – Priya Dec 23 '10 at 11:54
-1

Here is my answer.

egrep -i -r -f myfile.txt [path] > dumpdata.txt

its working. But i had to parse more by cleaning all functions of javascript and variable value of php containing strings.

Priya
  • 1,522
  • 1
  • 14
  • 31