represent chars in the actual forms

Question

I have a file which is generated by this command: fab -f vocab/fabfile build_vocab:<lang>,<corpus_files_root>. This command is a part of a guide of spaCy, and is obtained from here. Since this command works with fabric which in turn works with python 2, the output has a lot of Persian strings represented with their unicode codes, not the actual string, the string itself. In other words, I have the following:

2   1   u'\u0641\u0632\u0646\u062f\u0627\u0646'
1   1   u'\u200c\u0645\u0648\u0647\u0627\u06cc'
2   1   u'\u0627\u0641\u0646\u0647'
.
.
.

insted of this one:

2   1   u'فزندان'
1   1   u'موهای'
2   1   u'افنه'
.
.
.

As the next part of the process, run by the above-mentioned fabric ... command, it tries to read this file and compare it with the word in its actual form. So I think I need to convert the string represented in Unicode to the actual form. Is there any way to do so?

This is the `repr()` form. It is always ASCII-only in Python 2 AFAIK. You can upgrade to Python 3. Or use `unicode(...)` instead of `repr(...)`, but this won't give you the quotes around the strings (in case you actually need those). — lenz, Jan 08 '18 at 13:01
This is no problem for Python as it treats them the same – but how did you generate that file? It should be possible to change it. — Jongware, Jan 08 '18 at 13:01
It is a part of spaCy model vocabulary training. Unfortunately, that script works under python2. Is there any way to convert the texts above? — Gmosy Gnaq, Jan 08 '18 at 13:10
@GmosyGnaq, you need to provide more information. How do you process this output? Does it work in the second format, but not the first? If so, in what way does it fail? Please edit the question to clarify these points. — lenz, Jan 09 '18 at 07:45
Thank you @hovercraft-full-of-eels. It looks so. Should I do anything special? — Gmosy Gnaq, Jan 09 '18 at 13:10

represent chars in the actual forms

0 Answers0