0

I have a string:

05-01-2015 12:27 - KH - (KH) Igangværende - Opringning - 13-11 00:00 Fangede RLI på hans mobil. Ring igen kl. 15 19-11-2014 11:17 - KH - (KH) Igangværende - Opringning - 13-11 00:00 Gik på svarer igen og lagt besked til RLI at ringe tilbage. 12-11-2014 09:38 - KH - (KH) Igangværende - Opringning - 13-11 00:00 12-11-2014 09:32 - KH - (KH) Igangværende - Opringning - 15-10 00:00 Forsøgt RLI igen og lagt besked om han vil ringe. 14-10-2014 13:14 - KH - (KH) Igangværende - Opringning - 15-10 00:00 14-10-2014 13:10 - KH - (KH) Igangværende - Opringning - 14-10 00:00 Lagt besked til RLI at ringe 14-10-2014 13:06 - KH - (KH) Igangværende - Opringning - 14-10 00:00 test

I parse this string into pieces so that each piece starts with dates. For this purpose, as solved in my other post about my task I benefit from regex like :

match = re.search(r' (?=\d{2}-\d{2}-\d{4})', text)

When i write above string directly to variable text in code, there is no problem. But if i obtain this text from a cell in excel file with xlrd or others, i cant get values properly. I tried cell values with encode/decode things also. But i only get the text as a whole in match[0]. There are no splitted match1, match[2] or others. Here is how i try to get text from excel file :

# -*- coding: utf-8 -*-
import re
import xlrd

book = xlrd.open_workbook("liste1.xlsx")

# get the first worksheet
first_sheet = book.sheet_by_index(0)

# read a cell
cell = first_sheet.cell(1,5)

text=cell.value
match = re.split(r' (?=\d{2}-\d{2}-\d{4})', text)

print match[0]

Could you help me with this please?

Thanks in advance.

Community
  • 1
  • 1
Şansal Birbaş
  • 443
  • 1
  • 6
  • 14

2 Answers2

0

Have you tried something like repr(text)?

  • What it does exactly? – Şansal Birbaş Jun 24 '15 at 16:38
  • "Return a string containing a printable representation of an object." https://docs.python.org/3.4/library/functions.html#repr –  Jun 24 '15 at 16:40
  • No unfortunately. Let me tell you. There is an interestin situation. When i copy that text from this post i wrote in stackoverflow it does successfully. But if i copy same content from excel cell in libreoffice it doesnt work. – Şansal Birbaş Jun 24 '15 at 16:45
  • Yes, that is probably, because there are control characters in that cell that you can't see on screen. So, when printed on screen the control characters are automatically removed, but when you read from that cell, you have to remove those manually. –  Jun 24 '15 at 16:51
  • Have you tried `re.split(r' (?=\d{2}-\d{2}-\d{4})', repr(textfromcell))`? –  Jun 24 '15 at 16:52
  • 1
    Sorry, I have never worked with LibreOffice, but that is my guess. There are control characters that need be removed before using the regex or you need to change the pattern slightly. –  Jun 24 '15 at 17:04
0

I solved this issue by adding a line that removes non-printing characters from cell as already stated in this post Unwanted Character in Excel cell :

# -*- coding: utf-8 -*-
import re
import xlrd

book = xlrd.open_workbook("liste1.xlsx")

# get the first worksheet
first_sheet = book.sheet_by_index(0)

# read a cell
cell = first_sheet.cell(1,5)

text= re.sub(r"[\r\n\t\x07\x0b]", "", cell.value)
match = re.split(r' (?=\d{2}-\d{2}-\d{4})', text)

print match[0]
print match[1]
print match[2]
Community
  • 1
  • 1
Şansal Birbaş
  • 443
  • 1
  • 6
  • 14