13

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

  • search nested elements (infinite depth)
  • search attributes
  • search for sentences and paragraphs
  • search using wildcards
  • search using fuzzy matching
  • return precise matching info
  • good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

lecodesportif
  • 10,737
  • 9
  • 38
  • 58

6 Answers6

6

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

  • search nested elements (infinite depth): yes
  • search attributes: yes
  • search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
  • search using wildcards: yes (regular expressions)
  • search using fuzzy matching: no (assuming stemming, synonyms and so on...)
  • return precise matching info: yes
  • good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

Steven
  • 28,002
  • 5
  • 61
  • 51
2

I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/

What you can do is store a "document" in solr for each <elem /> in your xml. You can store any data you like in the document. Then you can search against the index and grab the id field stored in the matching documents. This will be very fast for a large set of documents.

Max
  • 6,901
  • 7
  • 46
  • 61
  • What's the overhead of Solr? Is there any Python integration? – lecodesportif Apr 27 '11 at 12:27
  • Solr is an indexing and search server. As for overhead, it depends on what you mean by that. If you want to search a large amount of these documents, using a full text search engine is really your best option. If you just have a few, it might not be worth it. With Solr you have to run a server, specify the format of the documents, and send them to it to be indexed. There is python integration, but it's also a simple api that responds with JSON so you can use it from anywhere. I highly recommend it! – Max Apr 27 '11 at 14:32
  • Thanks Max, this is something I'll keep in mind. But I need a more simple, easier to deploy, search solution for single, large XML files accessed by a Python script. – lecodesportif Apr 27 '11 at 14:38
  • [Solr](http://lucene.apache.org/solr/) is an extension of Lucence which is Java. Huge footprint and complexity for something as simple as these requirements. –  Apr 28 '11 at 19:47
  • I don't agree. His ideal requirements seem better suited for a full text search engine than a regex. – Max Apr 28 '11 at 19:57
  • Solr is the most feature-rich full text search engine out there. It has a ton of great libraries in python, for this use-case I'd recommend https://github.com/tow/sunburnt this library. It makes even complicated indexing/querying a breeze. – Zach Kelling May 03 '11 at 14:52
1
select="/elements/elem//[contains(.,'other')]"

see also xpath: find a node that has a given attribute whose value contains a string

Community
  • 1
  • 1
vartec
  • 131,205
  • 36
  • 218
  • 244
1

I'd recommend the following two:

Use XPath 2.0. It supports regular expressions.

Or,

Use XQuery and XPath (2.0) Full Text, which has even more powerful features.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
1

So, recently I had to create a XML to JSON converter. It doesn't conform exactly to the JSON standard, but it comes pretty close. The xml2json function returns a dictionary representation of the xml object. All element attributes are included in a dictionary with a key of attributes and element text are included in the text key.

For example, your xml object would look like this after its conversion:

json = {'elements': 
    {'elem': [
        {'attributes': {'id', '1'}, 'text': 'some element'},
        {'attributes': {'id', '2'}, 'text': 'some other element'},
        {'attributes': {'id', '3'}, 'text': 'some element', 'nested': {
            'attributes': {'id', '1'}, 'text': 'other nested element'}},
    ]}

Here is the xml2json function.

def xml2json(x):
    def get_attributes(atts):
        atts = dict(atts)
        d = {}
        for k, v in atts.items():
            d[k] = v.value
        return d

    def get_children(n, d):
        tmp = {}
        d.setdefault(n.nodeName, {})
        if n.attributes:
            tmp['attributes'] = get_attributes(n.attributes)
        if n.hasChildNodes():
            for c in n.childNodes:
                if c.nodeType == c.TEXT_NODE or c.nodeName == '#cdata-section':
                    tmp['text'] = c.data
                else:
                    children = get_children(c, {})
                    for ck, cv in children.items():
                        if ck in d[n.nodeName]:
                            if not isinstance(d[n.nodeName][ck], list):
                                tmpv = d[n.nodeName][ck]
                                d[n.nodeName][ck] = []
                                d[n.nodeName][ck].append(tmpv)
                            d[n.nodeName][ck].append(cv)
                        else:
                            d[n.nodeName][ck] = cv

        for tk, tv in tmp.items():
            d[n.nodeName][tk] = tv

        return d

    return get_children(x.firstChild, {})

Here is the searchjson function.

def searchjson(sobj, reg):
    import re
    results = []
    if isinstance(sobj, basestring):
        # search the string and return the output
        if re.search(reg, sobj):
            results.append(sobj)
    else:
        # dictionary
        for k, v in sobj.items():
            newv = v
            if not isinstance(newv, list):
                newv = [newv]

            for elem in newv:
                has_attributes = False
                if isinstance(elem, dict):
                    has_attributes = bool(elem.get('attributes', False))
                res = searchjson(elem, reg)
                res = [] if not res else res
                for r in res:
                    r_is_dict = isinstance(r, dict)
                    r_no_attributes = r_is_dict and 'attributes' not in r.keys()
                    if has_attributes and r_no_attributes :
                        r.update({'attributes': elem.get('attributes', {})})

                    results.append({k: r})

    return results

The search function I created after reading your question. It hasn't been 100% tested and probably has a few bugs, but I think it would be a good start for you. As for what you're looking for, it searches nested elements, attributes, using wildcards. It also returns the id of the elements.

You can use the function like so, where xml is the xml object to search and reg is a regex pattern string to search for, ex: 'other', 'oth.*', '.the.' will all find the elements with "other" in them.

json = xml2json(xml)
results = searchjson(json, reg='other')

results will be a list of dictionaries.

Hope it helps.

Bryce Siedschlaw
  • 4,136
  • 1
  • 24
  • 36
0

For single large files accessed by a Python script, you should look at Xapian it is a full featured full text indexing and search engine, that is mature and robust and has wonderful first class Python bindings. It works with Python like it was written in Python, no external servers to run or any silliness like that.

If you don't need to persist the indexes, and can use the in memory database it will be extremely fast. It is faster than Lucene based solutions and uses a tiny fraction of the resources.