0

I am trying to replace some characters in my hive output so that Pandas can read it properly as a DataFrame.

The first thing I tried was:

f2 = gzip.open(local_path, 'rb')
table = f2.read()
f2.close()

table = table.replace('\x01','\t')
table = table.replace('\\N','NULL')

f = gzip.open(local_path,'wb')
f.write(table) <-----ERROR
f.close()

But this failed at the point marked above with "OverflowError: size does not fit in an int". My next thought would be to do this

input_file = gzip.open(local_path, 'rb')
output_file = gzip.open(output_path, 'wb')
for line in input_file:
    line = line.replace('\x01','\t')
    line = line.replace('\\N','NULL')
    output_file.write(line)
output_file.close()
input_file.close() 
os.rename(output_path,local_path) 

but I am worried that it would be very slow. Is there a better way to do it?

If it is relevant to the solution, this is so that I can call

return = pd.read_table(local_path,compression='gzip')

Pandas has a terrible time handling the hive output characters so it needs to be done explicitly before.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
Keith
  • 4,646
  • 7
  • 43
  • 72
  • My experience with big files is that is is not that slower to treat them line by line, and it avoids any memory issue. Alternatively, you can read it by "chunks" of bytes, using `f2.read(1000000)` (1Mb) in a loop for instance, and break once it is empty. – JulienD Aug 05 '14 at 10:27
  • I thought about the chunks idea but was concerned that the edges may cut my characters in half so that they were not able to be recognized by replace(). Is there a way to do it in chunks of lines? – Keith Aug 05 '14 at 10:59
  • Your can read a chunk of 10Mb, split where it has and end-of-line character, replace in every element of the split except the last one, and fuse the last one with the first element of the next chunk split. (To me, a bit complicated for few outcome, better do it one line after another.) – JulienD Aug 05 '14 at 11:31

1 Answers1

1

As it turns pandas actually handles hive output parameters wonderfully if you specify both the na_values and the separator

df =  pd.read_table(local_path,compression='gzip',na_values='\\N',sep='\x01')    

The only potential issue is saving in a compressed format. The standard would be a pickle

df.to_pickle(output_path)

If you run into this problem: Pickling a DataFrame then you will have to save it as a large file.

df.to_csv(output_path)
Community
  • 1
  • 1
Keith
  • 4,646
  • 7
  • 43
  • 72