re.split doesnt work properly with a string coming from excel cell

Question

I have a string:

05-01-2015 12:27 - KH - (KH) Igangværende - Opringning - 13-11 00:00 Fangede RLI på hans mobil. Ring igen kl. 15 19-11-2014 11:17 - KH - (KH) Igangværende - Opringning - 13-11 00:00 Gik på svarer igen og lagt besked til RLI at ringe tilbage. 12-11-2014 09:38 - KH - (KH) Igangværende - Opringning - 13-11 00:00 12-11-2014 09:32 - KH - (KH) Igangværende - Opringning - 15-10 00:00 Forsøgt RLI igen og lagt besked om han vil ringe. 14-10-2014 13:14 - KH - (KH) Igangværende - Opringning - 15-10 00:00 14-10-2014 13:10 - KH - (KH) Igangværende - Opringning - 14-10 00:00 Lagt besked til RLI at ringe 14-10-2014 13:06 - KH - (KH) Igangværende - Opringning - 14-10 00:00 test

I parse this string into pieces so that each piece starts with dates. For this purpose, as solved in my other post about my task I benefit from regex like :

match = re.search(r' (?=\d{2}-\d{2}-\d{4})', text)

When i write above string directly to variable text in code, there is no problem. But if i obtain this text from a cell in excel file with xlrd or others, i cant get values properly. I tried cell values with encode/decode things also. But i only get the text as a whole in match[0]. There are no splitted match1, match[2] or others. Here is how i try to get text from excel file :

# -*- coding: utf-8 -*-
import re
import xlrd

book = xlrd.open_workbook("liste1.xlsx")

# get the first worksheet
first_sheet = book.sheet_by_index(0)

# read a cell
cell = first_sheet.cell(1,5)

text=cell.value
match = re.split(r' (?=\d{2}-\d{2}-\d{4})', text)

print match[0]

Could you help me with this please?

Thanks in advance.

Read the [documentation on `re.search`](https://docs.python.org/3.4/library/re.html#re.search) - it only finds the first match. You're probably looking for something like `re.findall`. — TigerhawkT3, Jun 23 '15 at 22:48
possible duplicate of [How can I find all matches to a regular expression in Python?](http://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python) — TigerhawkT3, Jun 23 '15 at 22:49
but it works when text="05-01-2015 12:27 - KH - (KH) Igangværende - Opringning - 13-11 00:00 Fangede RLI....." — Şansal Birbaş, Jun 23 '15 at 22:50
what do you mean it's `re.split` and not `re.search`?..Are you talking about your posted expression above or what? — Iron Fist, Jun 23 '15 at 23:12
Yes. If you look at other post that i gave its link above, i have a task and its already solved with re.split. The problem here is, exactly same text is coming from a cell value in excel file and this time doesnt work the way i need. — Şansal Birbaş, Jun 23 '15 at 23:14
Then you might want to mention this in your question, that you want to split not search...it's different — Iron Fist, Jun 23 '15 at 23:18
But i never talk about searching in my post and in the other post its already solved in splitting. I think what i need to fix here is clear. — Şansal Birbaş, Jun 23 '15 at 23:21
Also, you might want to post the code related to how you read that data from excel cell...it's relevant here — Iron Fist, Jun 23 '15 at 23:22
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/81379/discussion-between-khalil-ammour---and-sansal-birbas). — Iron Fist, Jun 24 '15 at 09:18

score 0 · Answer 1 · answered Jun 24 '15 at 16:37

0

Have you tried something like repr(text)?

answered Jun 24 '15 at 16:37

What it does exactly? – Şansal Birbaş Jun 24 '15 at 16:38
"Return a string containing a printable representation of an object." https://docs.python.org/3.4/library/functions.html#repr – Jun 24 '15 at 16:40
No unfortunately. Let me tell you. There is an interestin situation. When i copy that text from this post i wrote in stackoverflow it does successfully. But if i copy same content from excel cell in libreoffice it doesnt work. – Şansal Birbaş Jun 24 '15 at 16:45
Yes, that is probably, because there are control characters in that cell that you can't see on screen. So, when printed on screen the control characters are automatically removed, but when you read from that cell, you have to remove those manually. – Jun 24 '15 at 16:51
Have you tried `re.split(r' (?=\d{2}-\d{2}-\d{4})', repr(textfromcell))`? – Jun 24 '15 at 16:52
1

Sorry, I have never worked with LibreOffice, but that is my guess. There are control characters that need be removed before using the regex or you need to change the pattern slightly. – Jun 24 '15 at 17:04

score 0 · Accepted Answer · edited May 23 '17 at 12:21

I solved this issue by adding a line that removes non-printing characters from cell as already stated in this post Unwanted Character in Excel cell :

# -*- coding: utf-8 -*-
import re
import xlrd

book = xlrd.open_workbook("liste1.xlsx")

# get the first worksheet
first_sheet = book.sheet_by_index(0)

# read a cell
cell = first_sheet.cell(1,5)

text= re.sub(r"[\r\n\t\x07\x0b]", "", cell.value)
match = re.split(r' (?=\d{2}-\d{2}-\d{4})', text)

print match[0]
print match[1]
print match[2]

re.split doesnt work properly with a string coming from excel cell

2 Answers2

Linked