1

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word description excluding weird characters like $$#$#@$#@***$# and without html tags in the new file output.txt. Help me. Thanx in advance.

My final goal is to find the word description and print 35 characters after description which shouldn't include the html tags and weird characters. Is it possible? Like here:

<description>&lt;p&gt;&lt;img class="float_right"
 src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg"
 border="0" alt="dnu" width="400" height="300" /&gt;&lt;/p&gt;&lt;p&gt;The lawn
 was filled with &lt;a class="hidden_link"
 href="http://www.businessinsider.com/blackboard/goldman-sachs"&gt;Goldman
 Sachs&lt;/a&gt; Group Inc. partners dressed in pink looking out on a pink sunset.

I want to start from: The lawn is filled with (again skip those tags and continue from) Group Inc. partners (35 characters .done!) and then stop and search for another description!

Sicco
  • 6,167
  • 5
  • 45
  • 61
Joyfulgrind
  • 2,762
  • 8
  • 34
  • 41
  • Could you give a (large) example of your data? This allows us to easily see if, for example, `description` can occur multiple times in 1 line. Also do you want the first 35 characters excluding the 'weird' characters (thus potentially returning less then 35 characters) or do you want the first 35 non-weird characters. – Sicco Aug 15 '12 at 09:40
  • @Sicco: Here's a sample data: title>Here's How A Hamptons Fundraiser Workshttp://feedproxy.google.com/~r/businessinsider/~3/Sn8nE0mWfNo/how-a-hamptons-fundraiser-works-2012-8<p><img class="float_right" src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg" border="0" alt="dnu" width="400" height="300" /></p> – Joyfulgrind Aug 15 '12 at 09:51
  • OK; so your question is quite different now. You have an HTML file and want to retrieve some data from the HTML. Which attribute of the `description` tag do you want to extract? Ideally you would do this using XPath or XQuery, which is the best way to retrieve data from structured HTML/XML documents. – Sicco Aug 15 '12 at 10:01
  • Yea it's the source of HTML. The file extension is .txt . I don't care which attributes description carry. But while extract characters those tags and weird characters shouldn't be included 1) Search for word description 2) Extract 35 characters after description excluding those weird characters and html tags 3) Redirect output to output.txt Please help me!! – Joyfulgrind Aug 15 '12 at 10:07
  • Those 'weird' characters are [HTML entities](http://www.w3schools.com/html/html_entities.asp). Once interpreted its reads: `

    – Sicco Aug 15 '12 at 10:13
  • I have updated the question again. Please have a look at it. – Joyfulgrind Aug 15 '12 at 10:45

1 Answers1

1

You can select all the text within an HTML node using XPath. In your case this should work:

xpath -q -e '//description//text()' input.txt

The query //description//text() works as follows:

  • //description: drill down the HTML document till you find a node named description
  • //text(): within this node drill down all other nodes and select their text

Given your data this outputs:

The lawn was filled with 
Goldman Sachs
 Group Inc. partners dressed in pink looking out on a pink sunset.
Sicco
  • 6,167
  • 5
  • 45
  • 61
  • I have edited the question and added the sample. And, finally to redirect the output to output.txt >> output.txt would work, rite? Thank you! – Joyfulgrind Aug 15 '12 at 09:54
  • Which Linux distribution are you using? – Sicco Aug 15 '12 at 14:45
  • Debian and can't seem to install it! – Joyfulgrind Aug 16 '12 at 02:06
  • I would like to accomplish by result using sed and grep command. You mentioned that command before. But you edited with xpath. Can you please provide me the yesterday's command. Thank you! – Joyfulgrind Aug 16 '12 at 04:51
  • I still strongly recommend using XPath as this is both the easiest and optimal way to parse HTML (your question to use regex in the form of `grep`/`sed` is [not the first](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) on SO). You can [easily install the XPath command](http://ftp.de.debian.org/debian/pool/main/libx/libxml-xpath-perl/libxml-xpath-perl_1.13-7_all.deb) (which in itself is a simple `Perl` wrapper) and then use the above code. – Sicco Aug 16 '12 at 09:03
  • Thank you! Also, can u give the yesterday's sed and grep command? PLeaseee I will use that as a reference in other scripts. PLease.. – Joyfulgrind Aug 16 '12 at 09:12
  • Stack Overflow is like a Wiki, so all edits are saved. Click on the `edited – Sicco Aug 16 '12 at 09:20
  • Also, if you think my answer solved your question then you can accept if my ticking the checkmark to the left of my answer. – Sicco Aug 16 '12 at 09:20
  • Hey i installed the xpath and ran the command but the output is same as my script file. Also, the ran the previous command, the output is same as my script file. – Joyfulgrind Aug 16 '12 at 09:33
  • Could you put your whole HTML into a code block in your post or on pastebin.com, so I can test it on all your data. – Sicco Aug 16 '12 at 09:41
  • Finally this command worked!! sed -r 's/.*description>//g; s/<[^&]*>//g; s/(.{35}).*/\1/ ' input.txt Thank you for your effort. – Joyfulgrind Aug 16 '12 at 10:28