1

Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it. Please try to understand what is my actual requirement. The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index. the html files not only having text only but also they are having links , images etc.

thanks in advance hoping help from you regards

adesh kumar
  • 129
  • 3
  • 10

1 Answers1

2

In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:

  • Stored the HTML document as is on disk (you can store in the DB as well).
  • Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
  • The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
  • Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
  • Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.

HTH.

Vikdor
  • 23,934
  • 10
  • 61
  • 84
  • so can you help me how to store html files in database and how to access them back from data base – adesh kumar Oct 04 '12 at 05:32
  • You would just create a table of (ID, FILENAME, FILECONTENT, DATECREATED, DATEUPDATED) and store each HTML file in a record. The indexer process would pick the relevant records from the table, indexes them. During search, get the ID from the document object returned in Hits, retrieve the corresponding content and display in the JEditorPane. – Vikdor Oct 04 '12 at 05:35
  • can i store them in ms access too suppose i directly want to store all files from a folder into table of msaccess database how is it possible – adesh kumar Oct 04 '12 at 09:01
  • You can use JDBC to insert records in to the table, where one of the columns in the record is the actual content of the file you would have read from the directory where it's present. I would actually leave it on the disk and just store a reference to it in the database to leave the database less bulky. – Vikdor Oct 04 '12 at 09:17
  • ya thats fine but we cant give the data in actual format , we just want to give the html data in the form of lucene index so that in future it could be updated over internet – adesh kumar Oct 04 '12 at 10:55
  • Understood your use case now. – Vikdor Oct 04 '12 at 11:11
  • is there any solution to my problem – adesh kumar Oct 05 '12 at 06:30