I am trying to replace some characters in my hive output so that Pandas can read it properly as a DataFrame.
The first thing I tried was:
f2 = gzip.open(local_path, 'rb')
table = f2.read()
f2.close()
table = table.replace('\x01','\t')
table = table.replace('\\N','NULL')
f = gzip.open(local_path,'wb')
f.write(table) <-----ERROR
f.close()
But this failed at the point marked above with "OverflowError: size does not fit in an int". My next thought would be to do this
input_file = gzip.open(local_path, 'rb')
output_file = gzip.open(output_path, 'wb')
for line in input_file:
line = line.replace('\x01','\t')
line = line.replace('\\N','NULL')
output_file.write(line)
output_file.close()
input_file.close()
os.rename(output_path,local_path)
but I am worried that it would be very slow. Is there a better way to do it?
If it is relevant to the solution, this is so that I can call
return = pd.read_table(local_path,compression='gzip')
Pandas has a terrible time handling the hive output characters so it needs to be done explicitly before.