0

I am working on a web application and don't want user to input some invalid characters which are creating problems.

One such character which is causing problem is diamond bullet from MS word but to remove that character I need to know the Unicode of the character so that I could include it in the Python regular expression of invalid characters as suggested here.

textString = some value which need to be checked for invalid characters
pattern = some regular expression for invalid characters
if pattern.search(textString):
    print 'Invalid characters found'
else:
    print 'Valid string'

I found a similar solution here but this is not working for bullets.

Guys please help me resolving this issue.

Community
  • 1
  • 1
Deepak Uniyal
  • 89
  • 3
  • 16
  • Have you tried searching for "unicode bullet"? – mvw Oct 01 '15 at 12:04
  • @mvm, I tried! I found U+2022 – Plasma Oct 01 '15 at 12:05
  • @Plasma I also found this but do I need to search for Unicode value of that particular bullet? – Deepak Uniyal Oct 01 '15 at 12:08
  • Guys can you please give me hint if you know rather than down voting – Deepak Uniyal Oct 01 '15 at 12:09
  • If you know which characters you don't want to allow (or conversely, the only ones you want to allow), one approach could be to use a regex on unicode ranges. See [this](http://stackoverflow.com/questions/3748855/how-do-i-specify-a-range-of-unicode-characters-in-a-regular-expression-in-python) answer on how to work with unicode ranges. Edit: Also, [this](http://stackoverflow.com/questions/5698267/efficient-way-to-search-for-invalid-characters-in-python?lq=1) is a discussion on a general approach. – Plasma Oct 01 '15 at 12:13
  • @Plasma Yes, of course you are correct in pointing it out but my problem is knowing Unicode of a particular character i.e. diamond bullet in MS Word. Can you help me with that? – Deepak Uniyal Oct 01 '15 at 12:16
  • You should lookup tables of unicode characters. [This one of geometric shapes](http://www.fileformat.info/info/unicode/block/geometric_shapes/list.htm) contain some diamonds, at least – Plasma Oct 01 '15 at 12:18
  • Also, fileformat.info has a unicode character [search](http://www.fileformat.info/info/unicode/char/search.htm) to identify unicode values. – Plasma Oct 01 '15 at 12:26

1 Answers1

1

Create a Word document with your invalid characters. (Don't use the bullet maker icon, use the Insert->symbol->symbol browser and pick it from the map).

Unzip it.

unzip myDoc.docx

and open the word/document.xml file in an editor capable of reading the unicode characters. Here I am using xmllint and more as a quick and dirty example. I don't know which bullet you are talking about, but the one I tried here shows U+F075:

xmllint --format word/document.xml | more

<w:r w:rsidR="00A50B17" w:rsidRPr="00E62AD7">
    <w:rPr>
      <w:rFonts w:ascii="Wingdings" w:hAnsi="Wingdings"/>
      <w:color w:val="000000"/>
    </w:rPr>
    <w:t><U+F075></w:t>
  </w:r>

Then for all the unicode characters, put them in your script.

ergonaut
  • 6,929
  • 1
  • 17
  • 47