2

Hi am feeding context to zend_lucene_search and it can search for the word up to special characters and after that it is not searchable.

for example:

    very well to the other job boards � one of the main things that has impressed is the variety of the applications, especially with regards to the background of the candidates" manoj � Head 

if I search for 'boards' I can get it but if I search for one or any string after the unreadable characters, I cannot search it.

How to remove these and I want to get plain text.

I got these kind of characters on converting .docx/pdf files to text.

OR

let me know how to feed only text to zend_search_lucene..

Please help.

Manojkumar
  • 1,351
  • 5
  • 35
  • 63

2 Answers2

2

You can use following preg_replace function call to remove all non-ASCII (so called special) characters from your string:

$replaced = preg_replace('/[^\x00-\x7F]+/', '', $str);
// produces this converted text:
//    "very well to the other job boards  one of the main things that has impressed
// is the variety of the applications, especially with regards to the background of the
// candidates" manoj  Head"
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • What if the characters include punctuation or otherwise readable characters? – nageeb May 30 '12 at 13:34
  • @nageeb: `\x00-\x7F` range includes punctuation characters (ASCII) as well. – anubhava May 30 '12 at 13:35
  • I understand, but I assumed the user would want to preserve punctuation. – nageeb May 30 '12 at 13:38
  • Hi anubhav, I just used utf8_encode() function and got plain text. please let me know if there are any drawbacks of using utf8_encode functions.... Thank you nageeb... – Manojkumar May 30 '12 at 13:41
  • @NaanuManu: I didn't see any mention of utf8 in OP's question? I believe you got these special characters as a result of some pdf to txt conversion. – anubhava May 30 '12 at 13:44
  • @anubhava .. yes you are right.. I got this while converting from pdf to text... I used xpdf for this.. – Manojkumar May 30 '12 at 14:12
  • @anubhav : my solution gone wrong dude.... utf8_encode() just hides those special characters and we see that it is no more. but still I was facing same above problem. After applying your method I got it right. I think u understand HINDI.. aap ka ANUBHAV achha hai.. Thank you very much.. – Manojkumar May 30 '12 at 16:28
  • @NaanuManu: You're most welcome :) Yes I understand Hindi very well. – anubhava May 30 '12 at 16:37
  • @anubhava: Hi.. I have another question to ask in you. How to convert BLOB format to text. I have many CVs stored as BLOB in database and I need to get plain text of them. I have posted this question at http://stackoverflow.com/questions/10683885/how-to-read-text-from-the-blob-format also... Pls help. – Manojkumar Jun 01 '12 at 13:09
1

You might need to convert the character set of the string being treated to match the character set of the current HTML document.

For example, if your HTML document is using UTF-8, then you could run your string through utf8_encode(). Otherwise if you're not sure which character set to use, try using mb_convert_encoding() and playing around with some of the more common charsets.

nageeb
  • 2,002
  • 1
  • 13
  • 25