Can I use grep and regex to find certain content from an file and write the content into a new file?

Question

I want to grep some content from an html by using regex and write that content into a new html. The example HTML is as below:

<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content'>
<noscript>
<p>content we want</p>
</noscript>
</div>
</div></div>
<div class='footer'>
</div>
</html>

Can I use grep to select content between <div class='post-content'>and</div> and write the content into a new html? So the new html would look like this:

<div class='post-content'>
<noscript>
<p>content we want</p>
</noscript>
</div>

I did some research on Stack overflow and found some code that might be helpful to my issue, like

grep -L -Z -r "<div class='post-content'>.*?<\/noscript><\/dive>" .| xargs -0 -I{} mv {} DIR?

Is it correct? If it is, what does xargs part mean? Thank you and I'm looking forward to your reply!

Hi Cyrus, I tried yours, but somehow it didn't work for me. Thanks though! — Penny, Feb 09 '16 at 19:53

score 1 · Accepted Answer · answered Feb 09 '16 at 19:06

1

You can use this GNU sed

sed -n "/<div class='post-content'>/,/<\/div>/p" file.html > output.html

-n is not printing
p is print those lines in range

answered Feb 09 '16 at 19:06

josifoski

1,696
1
14
19

Thank you very much! It works perfectly! Now I'm thinking about how to adapt this piece of code to run through a folder that has, for example, 10 html files to extract the needed content and to generate new htmls...What other codes that I could add to this? Thanks! :) – Penny Feb 09 '16 at 19:44
Easy way will be using wildchars for input files like for example instead file.html using *.html and for output appending them in one file. For creating separate output files you will need some more complex bash script or using python ruby or similar languages – josifoski Feb 09 '16 at 19:49
1

if you decide appending in one output file, command could be sed -n "/
/,/<\/div>/p" *.html >> output.html
– josifoski Feb 09 '16 at 19:50
or this will be easiest idea: create backup folder of files, then use inplace editing with -i, so all files will be created with oneliner where names will be preserved sed -i -n "/
/,/<\/div>/p" *.html after backup of folder is created!
– josifoski Feb 09 '16 at 19:53
Hi Josifoski, thanks for your great answers! I found that by using your code, the original file gets amended too. So is it possible to just make changes to all the original htmls without writing to a new one...In that way I don't need to worry about the multiple output.htmls...I tried `sed -n "/
/,/<\/div>/p" file.html` it doesn't seem to work...Where did I do wrong?
– Penny Feb 09 '16 at 19:58
Yes, it's possible. That is easiest idea which i've meant. Backup first then add -i in sed command like sed -i -n "/
/,/<\/div>/p" *.html
– josifoski Feb 09 '16 at 20:00
Well Penny it seems you've omitted *.html – josifoski Feb 09 '16 at 20:03
Hey Josifoski, I tried the code `sed -i -n "/
/,/<\/div>/p" *.html`, but it looks like it doesn't change the original html as expected. Instead, a new html-n file were created, which looks the same as the original html without changes...Do you know why?
– Penny Feb 09 '16 at 20:09
That command will not create new htmls but will edit inplace original ones. Probably you have ticked option show hidden files. Also best perfomance with sed is using gnu version. You can see it with sed --version – josifoski Feb 09 '16 at 20:11
I'm not informed how nongnu seds will work in this case, maybe you are using some other sed. You are on linux, mac or ? – josifoski Feb 09 '16 at 20:13
:) That was problem. Best recommendation is installing GNU sed, second choise you can drop eye here http://stackoverflow.com/questions/5694228/sed-in-place-flag-that-works-both-on-mac-bsd-and-linux – josifoski Feb 09 '16 at 20:22
Try sed -i.bak -n "/
/,/<\/div>/p" *.html without installing gnu sed, i'm not sure will work for many files, but maybe
– josifoski Feb 09 '16 at 20:25
Hi Josifoski, yep it doesn't seem to work. Thanks for your answer. I'm wondering if I could install GNU sed on my osx? – Penny Feb 09 '16 at 20:28
Yes! i think easiest way will be: brew install gnu-sed --with-default-names "as stated here" http://stackoverflow.com/a/27834828/2397101 – josifoski Feb 09 '16 at 20:30
Thanks! Sorry for keeping bothering you...I installed the gnu version, but I seem to continue using the old version. I tried to check the version of my sed (sed --version), but the terminal keeps telling me that I used illegal --. I also tried other ways to check the version, but in vain... – Penny Feb 09 '16 at 20:44
1

:) no problem. here http://stackoverflow.com/questions/30003570/how-to-use-gnu-sed-on-mac-os-x – josifoski Feb 09 '16 at 20:48
use gsed instead of sed now – josifoski Feb 09 '16 at 20:49
Thank you! I moved the /usr/local/bin path ahead, so now it works! Hope this is my last question: can I write regex with the sed command like `sed -i -n "/
/,/<\/noscript><\/div>/p" *.html` or `sed -i -n "/
/,/<\/noscript.*?\/div>/p" *.html`? It seems like not working for me haha
– Penny Feb 09 '16 at 21:12
no need for adding previous line to be sure that correct will be found, since sed will make range till first . sed is working 'by lines', in case you still want complications, command can be modified and that included, but why? – josifoski Feb 09 '16 at 21:17
Because in my case, I don't want the sed to stop searching at the first , I need it to stop at the after the tag. The example I wrote in this post is just a sample. The real file that I'm dealing with is more complicated :) – Penny Feb 09 '16 at 21:25
1

sed is not correct tool for that requirement, -> python or similar – josifoski Feb 09 '16 at 21:38
I'll do more research about it. But for now, I'll mark down the one that you wrote. It is very helpful thanks Josifoski! – Penny Feb 09 '16 at 21:46
wait, we can do something... give me 2-3min :) – josifoski Feb 09 '16 at 21:50
1

difficult with sed, there is another tool called awk, i think that if you repost this question with additional requirements for bunch of files, for right interval with 2 lines tag it with awk in less then 10min you will receive complete answer – josifoski Feb 09 '16 at 22:04
1

Thank you! Will try it out! – Penny Feb 09 '16 at 22:10
1

here: sed -i -e ':a; N; $! ba' -e "s/^.*\$
.*<\/noscript> *\n *<\/div>\$.*$/\1/g" *.html
– josifoski Feb 09 '16 at 22:21
note that and must be in separate lines, \n is for newline – josifoski Feb 09 '16 at 22:28
Thank you so much!! It works great! Although I can't understand what's in there. but it works great!! Is it possible to explain it a little bit about what it means? – Penny Feb 09 '16 at 23:11
1

Glad to help. ':a; N; $! ba' reads whole file in pattern space for \n to may be catched. In other words we treat file as one string. (default's sed behaviour is treat string per line.) <\/noscript>space*\nspace* is (space* to catch eventual not polished ie trimmed interested lines, so space* is catching zero or more spaces) \1 is replacement part for everything which is between \$ \$. If you like better sed understanding http://www.tutorialspoint.com/sed/index.htm – josifoski Feb 10 '16 at 05:51

Can I use grep and regex to find certain content from an file and write the content into a new file?

1 Answers1