0

I want to index xml files of Wikipedia into Solr.

But I am getting an error, it is unable to index. Solr has a specific format for xml files. I changed the schema.xml and data-config.xml files to suit the tags of the wikipedia files.

Still it is unable to index the files. My actual intention is to index wikipedia which is an xml file of 30 GB.

How would I go about indexing all wikipedia files into Solr?

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
Satyam Roy
  • 336
  • 4
  • 18
  • I solved the same issue in this link http://stackoverflow.com/questions/20473798/indexing-wikipedia-with-solr. I hope it helps. – Marcelo Jan 06 '14 at 08:40

1 Answers1

1

There's an example section in the DataImportHandler documentation for exactly this: indexing Wikipedia.

Basically, you use the DataImportHandler and some XPath to pull the metadata you care about out of the Wikipedia XML, and put it in flat Solr field listings.

Dan Fitch
  • 2,480
  • 2
  • 23
  • 39
  • I tried it but it is not working .... The file gets committed but when I search for it it does't find any file – Satyam Roy Apr 03 '12 at 20:30
  • 2
    Are you sure the file is in there? What happens when you search for `*:*`? – Dan Fitch Apr 03 '12 at 20:35
  • When I search it doesn't show any result even for *:* as nothing is getting indexed – Satyam Roy Apr 04 '12 at 08:54
  • Okay, one more sanity check for you. Are you doing a `` operation after adding the file? [See this page for how to do that](http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22) - most Solr libraries and wrappers will make this very easy. – Dan Fitch Apr 04 '12 at 20:54