1

Hello all… thanks to the post Using Python: to split long string, by given ‘separators’, I learned a way to split a long string.

However the ‘seperators’ are lost when the string is split:

import re

text = "C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/42006Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.@54500RPMC-60150ccGas2007EngineCylinder:4VerticalInline2008Bore:1Stroke:1Cycle:42007Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/162008Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:92006Weight:4LBS1.75H.P.@65200RPM"

a = ['2006', '2007', '2008', '2009']

seperators = re.compile(r'|'.join(a))

e = seperators.split(text)

for f in e:
    print f

the result looks like:

C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4   # '2006' is missing
Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.@54500RPMC-60150ccGas   # '2007' is missing
EngineCylinder:4VerticalInline   # '2008' is missing
Bore:1Stroke:1Cycle:4   # '2007' is missing
Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16   # '2008' is missing
Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9   # '2006' is missing
Weight:4LBS1.75H.P.@65200RPM   

I want to have the ‘seperators’ kept when they are split. One way I tried is to add special characters in each ‘seperator’ then split the long string by the special character (in below, ‘@@@’ it is. And I know it’s not a smart way)

a = ['2006', '2007', '2008', '2009']

b = []

for eachone in a:
    b.append(eachone + '@@@')

my_dic = dict(zip(a, b))

for e, f in my_dic.iteritems():
    new_text = ''.join(text.replace(e, f))

however some characters are not replaced in the original string. Why?

On the other hand, is my way to split the long string with the ‘seperators’ kept is non-necessary? (I’ve checked other post but in my limited understanding, I can’t find the answer)

Thanks.

Community
  • 1
  • 1
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    What are you actually trying to achieve? The use of `re` rather than the much simpler `str.split` suggests that this is far more complex than the example. If you only have a single separator, then you know what it is and can just add it back in yourself after splitting... – sapi Sep 01 '14 at 08:37
  • thanks sapi. I want to keep the elements (in the list) in the long string, after I use them as separators to split the long string. – Mark K Sep 01 '14 at 08:43

2 Answers2

1

Use capture groups in your regex:

seperators = re.compile(r'(' + r'|'.join(a) + r')')

This way, the seperators will be kept.

Tim Zimmermann
  • 6,132
  • 3
  • 30
  • 36
1

If you use capturing groups in regex you will obtain the desired result:

seperators = re.compile(r'(%s)' % '|'.join(a))

Output

C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4
2006
Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.@54500RPMC-60150ccGas
2007
EngineCylinder:4VerticalInline
2008
Bore:1Stroke:1Cycle:4
2007
Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16
2008
Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9
2006
Weight:4LBS1.75H.P.@65200RPM

If you want to keep the delimiters at the end of the previous string, instead, you don't want to split but to find:

seperators = re.compile(r'.*?(?:%s|$)' % '|'.join(a))
e = seperators.findall(text)

Output

C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/42006
Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.@54500RPMC-60150ccGas2007
EngineCylinder:4VerticalInline2008
Bore:1Stroke:1Cycle:42007
Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/162008
Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:92006
Weight:4LBS1.75H.P.@65200RPM
enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
  • thanks, enrico.bacis. is there a way to have them together for example "C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/42006" instead of in 2 lines? – Mark K Sep 01 '14 at 08:59
  • thanks again, enrico.bacis. it's amazing! for what chapter/contents I shall learn to fully understand it? – Mark K Sep 01 '14 at 09:11
  • @MarkK If you read the full [`re`](https://docs.python.org/2/library/re.html) documentation and the [`HOWTO`](https://docs.python.org/2/howto/regex.html) you should gain a sound understanding of how regex works. Mark the answer as accepted if your problem is solved. – enrico.bacis Sep 01 '14 at 09:15
  • 1
    enrico.bacis, you a sexy king. – Mark K Sep 01 '14 at 09:18