re.split not working on ^A

Question

I'm trying to parse lines of input that look like

8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A

where you have different fields of data separated by ^A. I'm trying to get at the individual data fields (like 8=FIX.4.2, 9=0126, 35=0, etc). The problem is that python sometimes interprets ^A as a single character (in vim this is ctrl-v, ctrl-a) and sometimes as the string '^A' with two characters. So I have tried doing

entries = re.split('^A|^A', str(line))

but later when i do

for entry in entries:
    print entries

I just end up with the original string, with nothing split. Is this a problem with re.split?

I suspect it's not the case that Python sometimes interprets `^A` as a single character (control-A) and other times as a string (caret A) - I would take a careful look at the data it is reading, as it's more likely that got munged somehow... — twalberg, Jun 25 '13 at 14:57
I think that was the case. My original source file is huge, so I copied some lines into a different file for testing. I think they were \x01 in the original file and then converted to text in my test file. — swiz, Jun 25 '13 at 15:22

scarfboy · Answer 1 · 2013-06-25T15:12:33.520

5

Depends on what that line contains.

If you want to split on the 2-character string '^A', escape the special-to-regexps character ^, in this case probably meaning '\^A'.

It's more likely that this is instead the caret notation way of printing the single character with byte value 0x01, in which case you probably want to split on '\x01' instead.

(You might as well use string's own split() function, I'm guessing it's faster than using regexps for something this simple)

edited Jun 25 '13 at 15:12

answered Jun 25 '13 at 15:07

scarfboy

86
3

Yeah I originally used the split() function but changed to re.split() when I thought I had multiple delimiters. These files come from an external source so I am not 100% sure what the character is supposed to be but in bash they show up in a different color from standard text. – swiz Jun 25 '13 at 15:12
1

Each shell utility may have its own serialize-as-safe-text method, there are a few. It may help to see it in python's own representation, e.g. with `print repr(line)` – scarfboy Jun 25 '13 at 15:15
1

And yeah, if there are two-character `'^A'` things in your lines, it got mangled before python (or is actual data and not this separator, but that seems odd). Python doesn't interpret caret notation while reading data from files - or as far as I know, at all. – scarfboy Jun 25 '13 at 15:29

score 4 · Answer 2 · answered Jun 25 '13 at 14:44

^ has a special meaning in regular expressions, so you should escape it first.

>>> strs = "8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A"
>>> re.split('\^A',strs)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']

From docs:

'^' : (Caret.) Matches the start of the string, and in MULTILINE mode also
               matches immediately after each newline.

Martijn Pieters · Accepted Answer · 2013-06-25T15:08:58.367

^ is a metacharacter, it matches only at the start of a string. Escape it:

>>> re.split('\^A', line)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']

There is no need to use a | in your expression, especially not when both 'alternate' strings are the same.

It appears however that you have the \x07 or \a control character, not the two-character ^A string. Just use .split() to split on that value, no need for a regular expression:

>>> line = line.replace('^A', '\a')
>>> line
'8=FIX.4.2\x079=0126\x0735=0\x0734=000742599\x0749=L3Q206N\x0750=2J6L\x0752=20130620-11:16:27.344\x07369=000733325\x0756=CME\x0757=G\x07142=US,IL\x071603=OMS2\x071604=0.1\x07'
>>> line.split('\a')
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']

Thanks for your help. The problem is that '^A' the string and ^A the character show up in different colors on vim. In my original file (which is ~150 MB) they are the metacharacter type. I created a smaller test file by copy-pasting some lines into a different file, where they show up as ordinary text. Doing \^A effectively parses the test file, but in the real file the lines are still not splitting, and when I do logging.info(entry) I just get the entire line back. — swiz, Jun 25 '13 at 15:07
@swiz: in that case you have the `\a` control character, ASCII codepoint 7. Split on that. — Martijn Pieters, Jun 25 '13 at 15:07

re.split not working on ^A

3 Answers3

Linked