4

Possible Duplicate:
Python: Split string with multiple delimiters

Can I do something similar in Python?

Split method in VB.net:

Dim line As String = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
Dim separators() As String = {"Tech ID:", "Name:", "Account #:"}
Dim result() As String
result = line.Split(separators, StringSplitOptions.RemoveEmptyEntries)
Community
  • 1
  • 1
fpena06
  • 2,246
  • 3
  • 20
  • 28

3 Answers3

2

Given a bad data format like this, you could try re.split():

>>> import re
>>> mystring = "Field 1: Data 1 Field 2: Data 2 Field 3: Data 3"
>>> a = re.split(r"(Field 1:|Field 2:|Field 3:)",mystring)
['', 'Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

Your job would be much easier if the data was sanely formatted, with quoted strings and comma-separated records. This would admit the use of the csv module for parsing of comma-separated value files.

Edit:

You can filter out the blank entries with a list comprehension.

>>> a_non_empty = [s for s in a if s]
>>> a_non_empty
['Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
1
>>> import re
>>> str = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> re.split("Tech ID:|Name:|Account #:",str)
['', ' xxxxxxxxxx ', ' DOE, JOHN ', ' xxxxxxxx']
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Why do the `split` tokens themselves not appear in your output? Python 2 vs. Python 3 difference? – Li-aung Yip May 03 '12 at 06:10
  • That's a good question. I didn't catch that. – fpena06 May 03 '12 at 06:15
  • 3
    @Li-aungYip: :) Not really, nothing to do with Python version. Just that I did not enclose the pattern in `(...)` as a result they did not get captured. – codaddict May 03 '12 at 06:19
  • Just one small thing, you may not want to call your variable `str` since it is the name of a `builtin` – jamylak May 03 '12 at 06:25
  • Ah, I'm silly. I didn't realise that the split pattern would actually allow capturing. – Li-aung Yip May 03 '12 at 06:38
  • How can I make sure the output elements have no leading or trailing spaces? for example instead of ' DOE, JOHN ' I want 'DOE,JOHN' I'm having a hard time trying to use .strip or .rstrip – fpena06 May 03 '12 at 07:06
  • I manage to get it done this way. Is there a better way?`while workOrders > 0: line = ins.readline() array.append(re.split("Company:|Customer Information IV Retest Enforced:|Field 3:",line)[1]) array.append(re.split("Company:|Customer Information IV Retest Enforced:|Field 3:",line)[-1]) for x in array: x = x.strip(' \t\n\r') print x workOrders -=1` – fpena06 May 03 '12 at 07:23
0

I would suggest a different approach:

>>> import re
>>> subject = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> regex = re.compile(r"(Tech ID|Name|Account #):\s*(.*?)\s*(?=Tech ID:|Name:|Account #:|$)")
>>> dict(regex.findall(subject))
{'Tech ID': 'xxxxxxxxxx', 'Name': 'DOE, JOHN', 'Account #': 'xxxxxxxx'}

That way you get a useful data structure for this kind of data: a dictionary.

As a commented regex:

regex = re.compile(
    r"""(?x)                         # Verbose regex:
    (Tech\ ID|Name|Account\ \#)      # Match identifier
    :                                # Match a colon
    \s*                              # Match optional whitespace
    (.*?)                            # Match any number of characters, as few as possible
    \s*                              # Match optional whitespace
    (?=                              # Assert that the following can be matched:
     Tech\ ID:|Name:|Account\ \#:    # The next identifier
     |$                              # or the end of the string
    )                                # End of lookahead assertion""")
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • This doesn't seem like a good approach to me since you are repeating the identifiers. – jamylak May 03 '12 at 06:38
  • @jamylak: I know but how else would you be able to tell when the value has ended? It would be much better of course if you could preserve the delimiters but that doesn't seem to be an option. – Tim Pietzcker May 03 '12 at 06:44