0

I am trying to index the content of my site and since there is some javascript inside the <body></body>, it stores that as well of the content.

It actually gets everything in-between the <body></body>, but I use PHP's strip_tags to remove the HTML tags.

It removes the <script> tags, as they are HTML tags, but the javascript syntax remains.

How can I remove the javascript syntax?

Here is example of the content with the javascript syntax in it:

"Watch Later Added to Private videos will be skipped if viewers don't have access, but playlist notes are publicly visible. Back to list Added to playlist: Private videos will be skipped if viewers don't have access, but playlist notes are publicly visible. Add an optional note150 Add note Saving note... Note added to: Error adding note: Click to add a new note if (window.ytcsi) {ytcsi.tick("js_head");} yt.pubsub.subscribe('init', yt.www.brandedpage.channels4init.overviewTabInit); yt.pubsub.subscribe('dispose', yt.www.brandedpage.channels4init.overviewTabDispose); yt.setAjaxToken('c4_shelves_ajax', "0qjmgZRNi5AAlV5LrkVIKyY1_VZ8MTM2ODkyNTgzM0AxMzY4ODM5NDMz");"

How can I get it so that it is just

"Watch Later Added to Private videos will be skipped if viewers don't have access, but playlist notes are publicly visible. Back to list Added to playlist: Private videos will be skipped if viewers don't have access, but playlist notes are publicly visible. Add an optional note150 Add note Saving note... Note added to: Error adding note: Click to add a new note"

PitaJ
  • 12,969
  • 6
  • 36
  • 55
IMUXIxD
  • 1,223
  • 5
  • 23
  • 44
  • Use a HTML parser for that. It also has textContent. See http://stackoverflow.com/questions/12380919/php-dom-textcontent-vs-nodevalue – hakre May 19 '13 at 14:35

1 Answers1

2

you can first remove the script tags from your text with their content , and then run strip_tags on the result

removing the script tag can be made in many ways, one of those is regular expression:

$pattern = '/\<script.*\<\/script\>/iU'; //notice the U flag - it is important here
$text = preg_replace($pattern, '', $text);
$text = strip_tags($text);

another way (without using REGEX but less elegant):

while(($pos = stripos($text,"<script"))!==false){
    $end_pos = stripos($text,"</script>");
    $start = substr($text, 0, $pos);
    $end = substr($text, $end_pos+strlen("</script>"));
    $text = $start.$end;
}
$text = strip_tags($text);
Yaron U.
  • 7,681
  • 3
  • 31
  • 45
  • Is there a way to do it without regex? It is just that I am not using regex at all and would prefer not to. – IMUXIxD May 18 '13 at 01:35
  • 2
    yes, but it is way less elegant. you can achieve the same result by finding all the occurrences of ` – Yaron U. May 18 '13 at 01:39
  • That seems very ugly. Okay, thank you. – IMUXIxD May 18 '13 at 01:43
  • one more question. Why can't one just do `$text = preg_replace('/\/iU', '', $text);` ? – IMUXIxD May 18 '13 at 01:49
  • I've never said it can't be done - I just separated the pattern for more clarity on the code, you may do it if you wish. you can also do `$text = strip_tags(preg_replace('/\/iU', '', $text));` and save the second line as well – Yaron U. May 18 '13 at 01:50
  • Ah, okay. Also, I am trying the regex in http://http://www.debuggex.com/ and it doesn't seem to be working. I have `/\/iU` as the regex and `Hello how are you I am doing good. This is random content` as the string. – IMUXIxD May 18 '13 at 01:52
  • @IMUXIxD I've tried it both in http://regex.larsolavtorvik.com/ and in my code, and the result was `Hello how are you I am doing good. This is random content`, check again if you have any mistakes (the site you provided doesn't seem to be valid - make sure it runs the test on PHP PCRE regex engine - other engine may perform differently) – Yaron U. May 18 '13 at 01:54
  • 1
    @Yaron the problem with your regexp is input like this: `text1";text2`. It will give `text1";text2` instead of `text1text2`. This probably is what OP's input looks like. You _can not_ do this with a regexp. – Halcyon May 18 '13 at 12:43
  • @FritsvanCampen I ended up using the non regex solution, even though it is ugly, it works for me. – IMUXIxD May 18 '13 at 20:57