0

The original question is, given the sentence below, There are five people A1 to A5, they separate sentences based on their own knowledge. For example, A1,A2 and A4 separate the sentence into two, and A3 an A5 do not separate the sentence.

As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|

The objective is to divide the sentence into 2 sub-sentences: As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation and induced by 1 muM DNR and MXT. Also, each sentence have a five labels provided by the five people. For example, the first sentence should have five labels 1M,1S,1M,1S,1M and the seconde sentence should have five labels 2S,2S,1M,2S,1M

I use Python to do the job, first I use rawinput.split('|'), store the sentences into the array, then delete all the strings such as A1:1M, and then read again these labels and attached in array. It is very complex so is there any easy way to do the job? Such as using the re package? Thank you very much.

flyingmouse
  • 1,014
  • 3
  • 13
  • 29
  • "`A1` separate[s] the sentence" - Which sentence? How? Can you come up with a general "separation rule"? – Jasper Jan 05 '16 at 14:30
  • The rule is that, if meets `|***|`, then separates the sentence. In this example, the sentence should be separate into two. – flyingmouse Jan 05 '16 at 14:39

2 Answers2

2

Is this something you are looking for?

>>> re.split(r" (?:\|[^\|]+:[^\|]+\| ?)+", "As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|")

['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation',
'induced by 1 muM DNR and MXT.', '']

This uses the re.split() method to split the input at (?:\|[^\|]+:[^\|]+\| ?)+:

  • Start with a space " "
  • (?: ... )+ one or more of, without "capturing" (if you omit ?:, you will get everyting that is matched by this part in the result)
  • \| a literal |
  • [^:]+ anything but a colon, one or more times
  • : a literal colon
  • [^\|]+ anything but a |, one or more times
  • \| , a literal |
  • and an optional space " ?"

Because the input string ends with a separator, split() returns an empty string as last result in the list. This behavior applies to both str.split() and re.split():

>>> "a,b,".split(",")
['a', 'b', '']
>>> re.split("[abc]", "1a2b3c")
['1', '2', '3', '']

To remove the empty string from the list, you can simply discard the last element with slicing:

>>> "a,b,".split(",")[:-1]
['a', 'b']
>>> re.split("[abc]", "1a2b3c")[:-1]
['1', '2', '3']
Community
  • 1
  • 1
Jasper
  • 3,939
  • 1
  • 18
  • 35
  • Actually, my original data are not nicely formatted, so there may exist `|A1:1MG|`. So I changed your rule to `r" (?:\|.*\| ?)+"`, but I can only get one sentence and a space. I don't know why. – flyingmouse Jan 05 '16 at 15:18
  • @flyingmouse that is because `.*` is all chars including `|` and that expression is greedy a grab anything until the last `|`, like in my example use `[^\|]` to search any char other that `|` or be more specific and search only number and letter with `r" (?:\|[A-Z0-9]*:[A-Z0-9]*\| ?)+"` – Copperfield Jan 05 '16 at 15:36
  • My updated answer covers the more general cases now as well. – Jasper Jan 05 '16 at 16:55
  • Thank you @Jasper, but why there is an empty space in the result? Is it possible to delete it by modifying the `split()`? Thanks a lot~ – flyingmouse Jan 06 '16 at 02:17
  • You can't fix it by modifying `split`, but it's easily done with slicing. See updated answer. – Jasper Jan 06 '16 at 11:20
1

You can use a regular expression to separe the string and then filter each sub-string accordingly, in this case look like re.split is the solution

>>> import re
>>> test="""As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation |A1:1M| |A2:1S| |A4:1S| induced by 1 muM DNR and MXT. |A1:2S| |A2:2S| |A3:1M| |A4:2S| |A5:1M|"""
>>> re.split(r"(\|[^\|]+\|)",test)
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation ', '|A1:1M|', ' ', '|A2:1S|', ' ', '|A4:1S|', ' induced by 1 muM DNR and MXT. ', '|A1:2S|', ' ', '|A2:2S|', ' ', '|A3:1M|', ' ', '|A4:2S|', ' ', '|A5:1M|', '']
>>> temp=list(filter(lambda x: not x.startswith("|"),re.split(r"(\|[^\|]+\|)",test)))
>>> temp
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation ', ' ', ' ', ' induced by 1 muM DNR and MXT. ', ' ', ' ', ' ', ' ', '']
>>> resul=list(filter(bool,map(str.strip,temp)))
>>> resul
['As shown in Fig. 6, 1-h pretreatment of cells with 25 muM PhoCho or DiC8 inhibited by 30% DNA fragmentation', 'induced by 1 muM DNR and MXT.']
>>> 

with this r"(\|[^\|]+\|)" search for a literal | and anything that is not | that is in between and keep each |**| if that is of any use, otherwise the solution of Jasper is better

Copperfield
  • 8,131
  • 3
  • 23
  • 29