0

UPDATE: I opened an issue on github based on a Ivan Mainetti's suggestion. If you want to weigh in there, it is :https://github.com/orientechnologies/orientdb/issues/6757

I am working on a database based on OrienDB and using a python interface for it. I've had pretty good luck with it, but I've run into a problem that seems to be the driver's (pyorient) wonkiness when dealing with certain unicode characters.

The data structure I'm uploading to the database looks like this:

new_Node = {'@Nodes':
                {
                    "Abs_Address":Ono.absolute_address,
                    'Content':Ono.content,
                    'Heading':Ono.heading,
                    'Type':Ono.type,
                    'Value':Ono.value
                }
}

I have created literally hundreds of records flawlessly on OrientDB / pyorient. I don't think the problem is necessarily a pyorient specific question, however, as I think the reason it is failing on a particular record is because the Ono.absolute_address element has a unicode character that pyorient is somehow choking on.

The record I want to create has an Abs_address of /u/c/2/a1–2, but the node I get when I pass the value to the my data structure above is this:

{'@Nodes': {'Content': '', 'Abs_Address': u'/u/c/2/a1\u20132', 'Type': 'section', 'Heading': ' Transferred', 'Value': u'1\u20132'}}

I think that somehow my problem is python is mixing unicode and ascii strings / chars? I'm a bit new to python and not declaring types, so I'm hoping this isn't an issue with pyorient perse given that the new_Node object doesn't output the properly formatted string...? Or is this an instance of pyorient not liking unicode? I'm tearing my hair out on this one. Any help is appreciated.

In case the error is coming from pyorient and not some kind of text encoding, here's the pyorient-related info. I am creating the record using this code:

rec_position = self.pyo_client.record_create(14, new_Node)

And this is the error I'm getting:

com.orientechnologies.orient.core.storage.ORecordDuplicatedException - Cannot index record Nodes{Content:,Abs_Address:null,Type:section,Heading: Transferred,Value:null}: found duplicated key 'null' in index 'Nodes.Abs_Address' previously assigned to the record #14:558

The error is odd as it suggests that the backend database is getting a null object for the address. Apparently it did create an entry for this "address," but it's not what I want it to do. I don't know why address strings with unicode are coming up null in the database... I can create it through orientDB studio using the exact string I fed into the new_Node data structure... but I can't use python to do the same thing.

Someone help?

EDIT:

Thanks to Laurent, I've narrowed the problem down to something to do with unicode objects and pyorient. Whenever a the variable I am passing is type unicode, the pyorient adapter sends a null value to the OrientDB database. I determined the value that is causing the problem is an ndash symbol, and Laurent helped me replace it with a minus sign using this code

.replace(u"\u2013",u"-")

When I do that, however, pyorient gets unicode objects which it then passes as null values... This is not good. I can fix this short term by recasting the string using str(...) and this appears to solve my immediate problem:

str(Ono.absolute_address.replace(u"\u2013",u"-"))

. Problem is, I know I will have symbols and other unusual characters in my DB data. I know the database supports the unicode strings because I can add them manually or use SQL syntax to do what I cannot do via pyorient and python... I am assuming this is a dicey casting issue somewhere, but I'm not really sure where. This seems very similar to this problem: http://stackoverflow.duapp.com/questions/34757352/how-do-i-create-a-linked-record-in-orientdb-using-pyorient-library

Any pyorient people out there? Python gods? Lucky s0bs? =)

JSv4
  • 193
  • 1
  • 8
  • Can you show us your database schema (if any). How `Nodes.Abs_Address` index is declared? – Laurent LAPORTE Sep 29 '16 at 05:14
  • Sure, I created it using the OrientDB studio gui. It's a String type. If you're looking for something more specific, let me know how to get it from the DB and I'll be happy to grab it for you. I'm new to OrientDB. Thanks! – JSv4 Sep 29 '16 at 12:40
  • 1
    There is a similar question about String type/encoding, see: http://stackoverflow.com/questions/7381718 – Laurent LAPORTE Sep 29 '16 at 12:42
  • Thanks, Laurent, but I don't think that quite answers the question. I wish I knew if the problem is in python or OrientDB. The string I have cannot fit into ascii encoding because it unfortunately uses the \u2013 character. I tried to use a quick and dirty find and replace of that character, which is essentially a dash or minus sign (.replace("–", "-")), as done here: http://stackoverflow.com/questions/20329896/python-2-7-character-u2013 but that failed too... giving me the error from python that the ascii codec can't decode byte 0xe2 in position 0: ordinal not in range(128). – JSv4 Sep 29 '16 at 13:46
  • To replace the EN DASH (U+2013): `u'/u/c/2/a1\u20132'.replace(u"\u2013", u"-")`=> `u'/u/c/2/a1-2'`. It should work. You should replace the EN DASH in the 'Value': u'1\u20132' too. – Laurent LAPORTE Sep 29 '16 at 13:49
  • Interesting... you're right that did work... but now the values in my new_Node are all unicode objects u"..." I can see from the error message I'm getting (not a syntactic error but duplicate element in DB error) that the pyorient adapter for the DB is somehow sending null values to the DB whenever one of the arguments is a unicode string... Possibly a bug in the program? Seems similar to this question (http://stackoverflow.duapp.com/questions/34757352/how-do-i-create-a-linked-record-in-orientdb-using-pyorient-library). In the mean time, do you know if I can replace and then re-encode str? – JSv4 Sep 29 '16 at 14:04
  • Hi, this sound definitely like a bug! Could you open a issue on github? https://github.com/mogui/pyorient/issues or even better here https://github.com/orientechnologies/orientdb/issues – Ivan Mainetti Sep 29 '16 at 16:31
  • Will do. Adding it to orienttechnologies' github now. – JSv4 Sep 29 '16 at 18:29

1 Answers1

1

I have tried your example on Python 3 with the development branch of pyorient with the latest version of OrientDB 2.2.11. If I pass the values without escaping them, your example seems to work for me and I get the right values back.

So this test works:

def test_test1(self):
    new_Node = {'@Nodes': {'Content': '',
                           'Abs_Address': '/u/c/2/a1–2',
                           'Type': 'section',
                           'Heading': ' Transferred',
                           'Value': u'1\u20132'}
                }

    self.client.record_create(14, new_Node)

    result = self.client.query('SELECT * FROM V where Abs_Address="/u/c/2/a1–2"')
    assert result[0].Abs_Address == '/u/c/2/a1–2'

I think you may be saving the unicode value as an escaped value and that's where things get tricky.

I don't trust replacing values myself so I usually escape the unicode values I send to orientdb with the following code:

import json
def _escape(string):
    return json.dumps(string)[1:-1]

The following test would fail because the escaped value won't match the escaped value in the DB so no record will be returned:

def test_test2(self):
    new_Node = {'@Nodes': {'Content': '',
                           'Abs_Address': _escape('/u/c/2/a1–2'),
                           'Type': 'section',
                           'Heading': ' Transferred',
                           'Value': u'1\u20132'}
                }

    self.client.record_create(14, new_Node)

    result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape('/u/c/2/a1–2'))
    assert  result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'

In order to fix this, you have to escape the value twice:

def test_test3(self):
    new_Node = {'@Nodes': {'Content': '',
                           'Abs_Address': _escape('/u/c/2/a1–2'),
                           'Type': 'section',
                           'Heading': ' Transferred',
                           'Value': u'1\u20132'}
                }

    self.client.record_create(14, new_Node)

    result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape(_escape('/u/c/2/a1–2')))
    assert  result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'

This test will succeed because you will now be asking for the escaped value in the DB.

anber
  • 864
  • 1
  • 12
  • 28
  • Anber, the bigger issue I'm having is that I have tons of text that I didn't generate that has the occasional and unpredictable unicode character in it. I'm just pulling text out of a huge xml file, storing it in a variable, passing it to the data structure I showed above, and then trying to add it via pyorient. I like your idea of using JSON encoding... but I can't find and replace specific chars as I don't know what they will be ahead of time. Do you think your method will work regardless of what's in the string? – JSv4 Oct 12 '16 at 14:32
  • It seems like you have already been saving the unicode escaped values into your DB so you should probably keep doing it or else you will have a mess on your hands. Because you don't know which values will give you issues, you should escape all your values. You need to make sure that when you construct the queries, you need to escape non-escaped values twice and decode them back to the original values when you need to display them. I'm sure that would work but you will have to write your own unit tests to be sure. – anber Oct 12 '16 at 15:03