Python read text file and split on control character

Question

I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.

Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.

field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8

You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:

bash> cat test.txt | cut -d $'\001' -f 3
field3
field7

I then wrote a Python function that reads the file line-by-line using the standard Python idiom:

import re

def read_file(filename):
    ''' Read file line-by-line and split. '''
    with open(filename, "r") as myfile:
        for line in myfile:
            tokens = line.split('\u0001')
            #tokens = line.split('\^A')
            #tokens = re.split('\^A', line)
            print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])

However, when I run the function, the string is not split correctly. There should be four tokens in each line.

>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4


len(tokens): 1, tokens[0]: field5field6field7field8

As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.

tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)

Thanks for any help.

Related questions (none had a working solution for me):

delimiting carat A in python

re.split not working on ^A

Have you tried just `line.split('^A')` without the escape? Is your file encoded in another format by any chance? What does `print(line)` output? — r.ook, Oct 30 '18 at 19:51
please print a hexdump of line, so that we know exactly which character is control-A. — user803422, Oct 30 '18 at 19:54
Instead of doing `print` to debug in this instance, try doing `print(repr(tokens))`. That will show what the actual value of your string is — Woody1193, Oct 30 '18 at 19:56
@Woody1193: When I do `print(repr(line))`, it prints: `'field1\x01field2\x01field3\x01field4\n'`. — stackoverflowuser2010, Oct 30 '18 at 19:58
@stackoverflowuser2010 In that case, `\x01` is your delimiter — Woody1193, Oct 30 '18 at 20:03

user803422 · Accepted Answer · 2018-10-30T20:00:05.100

2

Assuming that control-A is character "\x01" (ASCII code 1):

>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']

If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):

>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']

edited Oct 30 '18 at 20:00

answered Oct 30 '18 at 19:52

user803422

2,636
2
18
36

Thank you! This line works: `tokens = line.split('\x01')`. But I wonder why `tokens = line.split('\u0001')` does not work? – stackoverflowuser2010 Oct 30 '18 at 19:56
Thanks again. `tokens = line.split(u'\u0001')` produces the correct answer as well. – stackoverflowuser2010 Oct 30 '18 at 20:17

Python read text file and split on control character

1 Answers1