I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split()
function is not splitting correctly even when I specify the delimiter.
Here is some sample data that is typical of what I get from Hadoop. Note that ^A
is actually a control character.
field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8
You can see that the Linux command-line tool cut
using the control code as a delimiter actually works. It is outputting the third field:
bash> cat test.txt | cut -d $'\001' -f 3
field3
field7
I then wrote a Python function that reads the file line-by-line using the standard Python idiom:
import re
def read_file(filename):
''' Read file line-by-line and split. '''
with open(filename, "r") as myfile:
for line in myfile:
tokens = line.split('\u0001')
#tokens = line.split('\^A')
#tokens = re.split('\^A', line)
print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])
However, when I run the function, the string is not split correctly. There should be four tokens in each line.
>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4
len(tokens): 1, tokens[0]: field5field6field7field8
As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.
tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)
Thanks for any help.
Related questions (none had a working solution for me):