3

I am working in a CF 9 environment with Solr collections. I have 7 of them that I'm working with, all are strictly PDFs. Using CFSEARCH, I'm not getting all of the documents that should be appearing in the results.

To give a specific example, the client has ten PDFs that contain the string 1386 somewhere in the body of the documents. But when using the search form and entering 1386, only 4 of them appear. The client is concerned that not all PDFs with 1386 are being displayed in search results.

I have been following (with great interest) David Faber's posts espousing the CFHTTP method of querying a Solr collection, but I'm running into snags trying to implement it.

One of the issues is that when using CFSEARCH, I'm using all four CUSTOM fields, and I'm also getting CONTEXT which will highlight the keyword. In the CFHTTP method, I'm not getting CONTEXT with highlighted keywords.

Also, I'm trying to deserialize JSON and convert that to a query object. But I keep getting the common error message about

attempting to reference a scalar variable array as a structure with members

Advice/suggestions greatly appreciated.

user3071284
  • 6,955
  • 6
  • 43
  • 57
  • 1
    For the JSON issue, I recommend you create some arrays, structures, regular variables in ColdFusion and then use serializeJSON() to observe the behavior when they are converted to JSON. It's pretty simple once you get the hang of it. – J.T. Mar 07 '13 at 15:08
  • As far as not finding all the results in your PDFs for '1386', I think it's because the OCR used to determine the contents of the document may not see that string in the PDF as '1386'. Perhaps it is being stored as 'I386' or '13B6' or something like that. I've got the same issue but assumed it was just a bad OCR read and left it at that. A 60% failure rate seems a bit high though. Depends on the document's quality I suppose. I'm interested to see if you find something else is the problem. – genericHCU Mar 07 '13 at 15:47
  • Could also be related to this if you have larger documents. http://www.raymondcamden.com/index.cfm/2011/8/22/Indexing-PDFs-with-Solr-Read-this-tip – genericHCU Mar 07 '13 at 15:53
  • @Travis: A possibility.. but as I understand it, all the PDFs were created from Word documents using Adobe's toPDF plugin, not by scanning hard-copy documents. Does OCR still come into play, for that? That link you provided might contain the answer. I'll change that on my dev system and see. If that's the case, then I'll have to petition the webadmin to do the same on production. Thanks! –  Mar 07 '13 at 16:44
  • Honestly I don't know if SOLR uses OCR at all or how it reads PDFs. I only have 3 collections that I've recently inherited. I moved them to SOLR but the verity collection wasn't returning all the results either. I haven't looked hard into the SOLR or verity engines. Sorry I can't be more help. – genericHCU Mar 07 '13 at 16:47
  • I'm glad that Verity is no longer supported in CF. Bulky and bloated, not at all what Solr is. I don't think collections rely on OCR (at least, I don't think I've ever seen a PDF with text in an image come up in results if any of that text were part of the search.) –  Mar 07 '13 at 17:04
  • @J.T.: Thank you! I had to tweak over and over, but finally got the JSON to convert to a query (using CFLOOP, sadly) and it's even working with the pagination (I had to add a variable for incrementation.) –  Mar 07 '13 at 17:07
  • Not that it matters now, but when I was getting that reference scalar variable array error message, it was because I was using deserializeJSON but making my ajax calls through JQuery while I had the secureJSON flag as true in application.cfc. The SecureJSON flag prepends a json return with two slashes, and jQuery doesn't expect to see it. I added $.ajaxSetup and a datafilter to chop off those first two slashes. When you return your ajax calls with CF tags, it does that for you automatically. – K_Cruz Mar 07 '13 at 17:43
  • *serializeJSON, not deserializeJSON – K_Cruz Mar 07 '13 at 17:54
  • When I deserializeJSON the data, I set the flag to false. That seemed to help. –  Mar 07 '13 at 18:05
  • MS word can inject characters in strings that cannot be shown. The HTML equivalent would be something like `13B6`. To the end user, they see nothing, but if you tried to use a find on it, it would never show up. – James A Mohler Sep 17 '13 at 22:31

0 Answers0