-1

Im trying to perform regular match on data that came from excel to python array using openpyxl but the data came as unicode and "None" is allways given by python. The data in Hebrew and i whant to convert the strings from excel to strings that can be matched using regex.. what can be done?

import re
from openpyxl import load_workbook

file_name = 'excel.xlsx'
wb = load_workbook(file_name)
ws = wb[u'beta']
li = []
li2 = []
#readin the cells from excel into an array
for i in range(1,1500):
li2.append(ws["A"+str(i)].value)

for i in li2:
    if i != None:
    li.append(i)
#deliting the unwanted list for making memory
del li2

r = re.match("א",li[1])
r == None
>>> True

the wanted resault is r.string = "somthing..." and not r == None

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on       win32
Type "copyright", "credits" or "license()" for more information.
 >>> ================================ RESTART ================================
 >>> 
 >>> li[1]
 u"\u05d0\u05d1\u05d5 \u05d2'\u05d5\u05d5\u05d9\u05d9\u05e2\u05d3 (\u05e9\u05d1\u05d8)"
 >>> print li[1]
 אבו ג'ווייעד (שבט)
 >>> r = re.match(u'א',li[1])
 >>> r ==None
 True
 >>> r = re.match(ur'א',li[1])
 >>> r = re.match(u'',li[1])
 >>> r.string
 u"\u05d0\u05d1\u05d5 \u05d2'\u05d5\u05d5\u05d9\u05d9\u05e2\u05d3      (\u05e9\u05d1\u05d8)"
 >>> unicode('א')

 Traceback (most recent call last):
   File "<pyshell#7>", line 1, in <module>
   unicode('א')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0:   ordinal not in range(128)
 >>> u'א'
 u'\xe0'
 >>> u'א'.encode("utf8")
 '\xc3\xa0'
 >>> u"א"
 u'\xe0' 
 >>> 
Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
usher
  • 86
  • 1
  • 4
  • Put up your codes and possible the errors you get – Nobi Nov 24 '15 at 08:11
  • 1
    Possible duplicate of [python and regular expression with unicode](http://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode) –  Nov 24 '15 at 08:29

2 Answers2

0

I put the Hebrew letter specified in your code into several cells, and then ran this code:

# -*- coding: utf-8 -*-
import re
from openpyxl import load_workbook

file_name = 'worksheet.xlsx'
wb = load_workbook(file_name)
ws = wb[u'beta']
li = []
li2 = []
#readin the cells from excel into an array
for i in range(1,1500):
    li2.append(ws["A"+str(i)].value)

for i in li2:
    if i != None:
        li.append(i)
#deliting the unwonted list for clearing memory
del li2

print "Non-empty cells: "
print li

r = re.search(u"א", li[1])

print "Match in: " 
print r.string.encode('utf-8')
print "Position: " 
print r.span()

Output:

Non-empty cells:
[u'Hebrew letter test 1 \u05d0', u'Hebrew letter test 2 \u05d0', u'Hebrew letter test 3 \u05d0', u'Hebrew letter test 4 \u05d0']
Match in:
Hebrew letter test 2 ÎÉ
Position:
(21, 22)

Please let me know if that's what you needed.

zephi
  • 418
  • 3
  • 16
-1

and the answer is:

import re
from openpyxl import load_workbook
file_name = "excel.xlsx"
wb = load_workbook(file_name) 
ws = wb[wb.get_sheet_names()[0]]

#regex

match = re.search(r"\d",ws["A2"].value )
print match.group(0)

:)

usher
  • 86
  • 1
  • 4