1

I'm running the following steps in an attempt to clean-up a string which is obtained using $query_text_lower = file_get_contents(websiteURL).

What I want, is to return just words. No javascript, no random numbers, no CSS or any other kinds of scripts.

//remove javascript
$query_text_lower = preg_replace("/<script[^>]*>.*?< *script[^>]*>/i", "", $new_text); 

//remove html tags
$query_text_lower2 = strip_tags($query_text_lower);

//removes any text containing links (may not be best, as some sites link useful words within the text. Does tend to remove a lot of ads though
$query_text_lower3 = preg_replace('/<a\s.*?>.*?<\/a>/s', '', $query_text_lower2);

//removes linebreaks
$query_text_lower4 = trim(preg_replace('/\s+/', ' ', $query_text_lower3));

echo $query_text_lower4;
die();

Here is an example of what I am outputting at the moment:

developing a cafe: 13 steps - wikihow /**/ /**/ messages log in log in via log in remember me forgot? create an account explore community dashboardrandom articleabout uscategoriesrecent changes help us communication an articlerequest a new articleanswer a requestmore ideas... edit edit this article home » categories » investment and business » business » buying & forming a business » hospitality businesses articleeditdiscuss wh.mergelang({ 'navlist_collapse': '- collapse','navlist_expand': '+ expand','usernameoremail': 'username or email','password': 'password' }); edit articlehow to start a cafe edited by harri, maluniu, annie, afc8871 and 1 other if you have always dreamt of management a business, then learning how to start a cafe may be the answer. with the right planning beforehand, opening a cafe can become highly profitable. your cafe can easily become a place where staff come to relax, enjoy schedule with friends or family, grab a quick bite to eat, or to work on their latest project. start a cafe business by following the steps below. ad google_ad_customer = "ca-pub-9543332082073187"; /* iframe unit - intro */ if(abtype == 2) google_ad_slot = '6354743772'; else //a or normal google_ad_slot = '8579663774'; if(abtype == 2 || abtype == 3 || abtype == 4 || abtype == 5 || abtype == 6) { google_ad_width = 671; google_ad_height = 120; google_max_num_ads = 2; } else if(abtype >= 7) { google_ad_width = 645; google_ad_height = 60; google_max_num_ads = 1; } else { google_ad_width = 671; google_ad_height = 60; google_max_num_ads = 1; } google_ad_results = 'html'; google_override_format = true; google_ad_channel = "0206790666+7733764704+1640266093+6709519645+8052511407+6822404019+7122150828" + gchans + xchannels; if( fromsearch ) { document.communication(''); } //--> edit steps 1communication your business and marketing plans. these are very important aspects of any business, as they will show your course of action for both management and marketing the business. refer to these documents often to make sure you stay on track. without these documents, you may not be able to secure funding. ad google_ad_customer = "ca-pub-9543332082073187"; /* iframe unit - first step */ if(abtype == 2) google_ad_slot = '4878010577'; else //a or normal google_ad_slot = '5205564977'; if(abtype == 2 || abtype == 3 || abtype == 4 || abtype == 5 || abtype == 6) { google_ad_width = 629; google_ad_height = 120; google_max_num_ads = 2; } else if(abtype >= 7) { google_ad_width = 600; google_ad_height = 60; google_max_num_ads = 1; } else { google_ad_width = 629; google_ad_height = 60; google_max_num_ads = 1; } google_ad_results = 'html'; google_override_format = true; google_ad_channel = "2748203808+7733764704+1640266093+6709519645+8052511407+2490795108+6822404019+7122150828" + gchans + xchannels; document.communication(''); //--> 2follow all legalities for starting a cafe business of this nature in your area. make sure you get all the necessary licenses, permits, and insurance required on federal, state, and local levels. 3secure funding for your business. in your business plan, you determined how much funding you need to start a cafe business. contact investors, apply for loans, and use whatever capital you have on hand to start the business. 
riverrunner
  • 115
  • 9
  • 1
    Consider [PHP Simple HTML DOM](http://simplehtmldom.sourceforge.net) or [DOMDocument](http://php.net/domdocument) to walk the DOM and get the inner text you want. – bishop Dec 15 '13 at 03:18
  • possible duplicate of [Converting HTML to plain text in PHP for e-mail](http://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail) – Ilmari Karonen Dec 15 '13 at 03:43

2 Answers2

1

Your javascript regex is off

you have:

$query_text_lower = preg_replace("/<script[^>]*>.*?< *script[^>]*>/i", "", $new_text); 

You're not detecting </script> inside the returned document, so it's not removing the javascript code itself from the page, but when you call striptags, you are removing the tags, so they don't appear in your final output. However, I can't see the site you're pulling this from so I can't be 100% on that one.

Let me know if that makes sense. Basically, the way it looks to me is that your first regex isn't actually matching anything.

Zarathuztra
  • 3,215
  • 1
  • 20
  • 34
  • I think you pinpointed the problem. Changed the regex to: $query_text_lower = preg_replace('/ – riverrunner Dec 15 '13 at 04:06
1

You can't parse that stuff out using regular expressions. The suggestion to parse the HTML using the existing DOM tools is the right way to go.

Peter Wooster
  • 6,009
  • 2
  • 27
  • 39