what is the command in terminal to extract text from a file

Question

hey can any one tell me to write the command in terminal to extract text from a html file using tags like <li>, <strong>, <b>, <title>, <td>...etc...and $var="strings" and javascript functions using msgstring....

->i am thinking of putting these tags in a text file...

->then i want to match the tags with the help of command of terminal...

->then i have to put that into a dump file(text)...

because...i want to change the text with language preference....

i tried with awk script and egrep too....but i got poor result...

http://stackoverflow.com/editing-help – deceze Dec 23 '10 at 08:51 — deceze, Dec 23 '10 at 08:51

score 4 · Answer 1 · answered Dec 31 '20 at 17:23

4

This is exactly what pandoc is for.

pandoc filename.html -f html -t plain -o filename.txt

As a bonus, resulting plain text is beautifully formatted.

See Pandoc Manual.

answered Dec 31 '20 at 17:23

Marek Kowalczyk

101
9

score 2 · Answer 2 · edited May 23 '17 at 09:58

2

Doing this with awk and egrep would probably mean using regular expressions to parse HTML. This is a bad idea. See this famous answer

Rather, use an HTML parser. See other answers in the link above for links to HTML parsers.

As to parsing PHP source code:

As it is structurally similar to HTML, you might be able to use a (tolerant) HTML parser. Otherwise, use a PHP parser. See e.g. this answer.

edited May 23 '17 at 09:58

Community

1
1

answered Dec 23 '10 at 08:56

sleske

81,358
34
189
227

i need to parse not only html,,php too... i think egrep will work... but i am not getting the exact term.... – Priya Dec 23 '10 at 08:58
@codaddict: Did you try scrolling the page? :-) – sleske Dec 23 '10 at 09:00

score 1 · Answer 3 · answered Dec 23 '10 at 09:09

1

Use regex like this:

perl -pne '/<strong>(.*)?<\/strong>/;' file

Of course, your regex will be more complex, I guess.

answered Dec 23 '10 at 09:09

sw0x2A

298
2
7

like this it'll take a lot time... i have to enter each tag once and then parse it.... it'll be better if i'll put all tags in a text file and match that file with the file i need to parse.... – Priya Dec 23 '10 at 11:54

Priya · Answer 4 · 2021-01-01T13:43:30.753

-1

Here is my answer.

egrep -i -r -f myfile.txt [path] > dumpdata.txt

its working. But i had to parse more by cleaning all functions of javascript and variable value of php containing strings.

edited Jan 01 '21 at 13:43

answered Dec 23 '10 at 12:02

Priya

1,522
1
14
31

what is the command in terminal to extract text from a file

4 Answers4