2

I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.

Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.

field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8

You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:

bash> cat test.txt | cut -d $'\001' -f 3
field3
field7

I then wrote a Python function that reads the file line-by-line using the standard Python idiom:

import re

def read_file(filename):
    ''' Read file line-by-line and split. '''
    with open(filename, "r") as myfile:
        for line in myfile:
            tokens = line.split('\u0001')
            #tokens = line.split('\^A')
            #tokens = re.split('\^A', line)
            print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])

However, when I run the function, the string is not split correctly. There should be four tokens in each line.

>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4


len(tokens): 1, tokens[0]: field5field6field7field8

As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.

tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)

Thanks for any help.

Related questions (none had a working solution for me):

delimiting carat A in python

re.split not working on ^A

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

1 Answers1

2

Assuming that control-A is character "\x01" (ASCII code 1):

>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']

If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):

>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']
user803422
  • 2,636
  • 2
  • 18
  • 36