Regex to split by semicolon and comma, except in some circustances

Question

I have a string that are comma delimited, however, some of the strings are datetimes which are delimited as well with semicolons and commas. For example:

'CreationTime: 1/12/2021 6:21:07 PM, LastAccessTime: 1/12/2021 6:21:05 PM, LastWriteTime: 1/12/2021 6:21:05 PM, ChangeTime: 1/12/2021 6:21:07 PM, FileAttributes: N, AllocationSize: 4,096, EndOfFile: 3,115, NumberOfLinks: 1, DeletePending: False, Directory: False, IndexNumber: 0x3000000032494, EaSize: 0, Access: Generic Read, Position: 0, Mode: Sequential Access, Synchronous IO Non-Alert, AlignmentRequirement: Word'

Clearly, CreationTime: 1/12/2021 6:21:07 PM is the first string I want parsed, and I also want to separate CreationTime from it's attribute 1/12/2021 6:21:07 PM while doing that for every item separated by a ,, with a header and attribute separated by a :.

To make things more complicated, some headers have multiple attributes

Mode: Sequential Access, Synchronous IO Non-Alert

So both Sequential Access, Synchronous IO Non-Alert are two attributes belonging to Mode: but they are not to be confused with the comma delimiter for the next header that comes after AlignmentRequirement:.

Question

How can I parse my example string so it returns a header and attribute (e.g. Mode and Sequential Access, Synchronous IO Non-Alert) given that their are semicolons and commas in the attributes themselves.

@buran, Nope. They are all in the string itself, directly preceding the semicolon. And I need to do this for thousands of different strings with different headers so I need to do this adaptively. — Jamie Dimon, Jan 17 '21 at 20:20
I couldn't understand the down vote, sorry. My answer gives you what you need. If you want something different, you should make it more explicit. — armamut, Jan 17 '21 at 20:41

score 1 · Accepted Answer · answered Jan 17 '21 at 20:46

Use re.split to split the string. Put parentheses around [A-Za-z]+ to return the matched part as well (these will become the dictionary keys later). Remove the first element of the list (an empty string) using a slice: [1:]. Finally, convert the list into a dictionary using any of the methods. I chose to do so using an iterator.

import re
s = 'CreationTime: 1/12/2021 6:21:07 PM, LastAccessTime: 1/12/2021 6:21:05 PM, LastWriteTime: 1/12/2021 6:21:05 PM, ChangeTime: 1/12/2021 6:21:07 PM, FileAttributes: N, AllocationSize: 4,096, EndOfFile: 3,115, NumberOfLinks: 1, DeletePending: False, Directory: False, IndexNumber: 0x3000000032494, EaSize: 0, Access: Generic Read, Position: 0, Mode: Sequential Access, Synchronous IO Non-Alert, AlignmentRequirement: Word'
it = iter(re.split(r'([A-Za-z]+):', s)[1:])
dct = dict(zip(it, it))
print(dct)
# {'CreationTime': ' 1/12/2021 6:21:07 PM, ', 'LastAccessTime': ' 1/12/2021 6:21:05 PM, ', 'LastWriteTime': ' 1/12/2021 6:21:05 PM, ', 'ChangeTime': ' 1/12/2021 6:21:07 PM, ', 'FileAttributes': ' N, ', 'AllocationSize': ' 4,096, ', 'EndOfFile': ' 3,115, ', 'NumberOfLinks': ' 1, ', 'DeletePending': ' False, ', 'Directory': ' False, ', 'IndexNumber': ' 0x3000000032494, ', 'EaSize': ' 0, ', 'Access': ' Generic Read, ', 'Position': ' 0, ', 'Mode': ' Sequential Access, Synchronous IO Non-Alert, ', 'AlignmentRequirement': ' Word'}

score 0 · Answer 2 · answered Jan 17 '21 at 20:30

I think this will work:

t = 'CreationTime: 1/12/2021 6:21:07 PM, LastAccessTime: 1/12/2021 6:21:05 PM, LastWriteTime: 1/12/2021 6:21:05 PM, ChangeTime: 1/12/2021 6:21:07 PM, FileAttributes: N, AllocationSize: 4,096, EndOfFile: 3,115, NumberOfLinks: 1, DeletePending: False, Directory: False, IndexNumber: 0x3000000032494, EaSize: 0, Access: Generic Read, Position: 0, Mode: Sequential Access, Synchronous IO Non-Alert, AlignmentRequirement: Word'
# re.split with lookahead, then split first ": "
dict([x.split(': ', maxsplit=1) for x in re.split(r', (?=[^\s:]+\:)', t)])
>>>
{'CreationTime': '1/12/2021 6:21:07 PM',
 'LastAccessTime': '1/12/2021 6:21:05 PM',
 'LastWriteTime': '1/12/2021 6:21:05 PM',
 'ChangeTime': '1/12/2021 6:21:07 PM',
 'FileAttributes': 'N',
 'AllocationSize': '4,096',
 'EndOfFile': '3,115',
 'NumberOfLinks': '1',
 'DeletePending': 'False',
 'Directory': 'False',
 'IndexNumber': '0x3000000032494',
 'EaSize': '0',
 'Access': 'Generic Read',
 'Position': '0',
 'Mode': 'Sequential Access, Synchronous IO Non-Alert',
 'AlignmentRequirement': 'Word'}

I really don't understand the downvote. Please upvote if you like. thx. — armamut, Jan 17 '21 at 20:57
I don't understand the downvote either. This solved my problem perfectly. My question got downvoted even though it has two correct answers, so obviously someone doesn't like regex very much. — Jamie Dimon, Jan 17 '21 at 22:47

Regex to split by semicolon and comma, except in some circustances

Question

2 Answers2