7

I am reading first line of all the files in a directory, on local it works fine but on EMR this test is failing at stuck at around 200-300th file. Also ps -eLF show increase of childs to 3000 even print in on 200th line.

It this some bug on EMR to read max bytes? pydoop version pydoop==0.12.0

import os
import sys
import shutil
import codecs
import pydoop.hdfs as hdfs


def prepare_data(hdfs_folder):
    folder = "test_folder"
    copies_count = 700
    src_file = "file"

    #1) create a folder
    if os.path.exists(folder):
        shutil.rmtree(folder)
    os.makedirs(folder)

    #2) create XXX copies of file in folder
    for x in range(0, copies_count):
        shutil.copyfile(src_file, folder+"/"+src_file+"_"+str(x))

    #3) copy folder to hdfs
    #hadoop fs -copyFromLocal test_folder/ /maaz/test_aa
    remove_command = "hadoop fs -rmr "+ hdfs_folder
    print remove_command
    os.system(remove_command)
    command = "hadoop fs -copyFromLocal "+folder+" "+ hdfs_folder
    print command
    os.system(command)

def main(hdfs_folder):
    try:
        conn_hdfs = hdfs.fs.hdfs()
        if conn_hdfs.exists(hdfs_folder):
            items_list = conn_hdfs.list_directory(hdfs_folder)
            for item in items_list:
                if not item["kind"] == "file":
                    continue
                file_name = item["name"]
                print "validating file : %s" % file_name

                try:
                    file_handle = conn_hdfs.open_file(file_name)
                    file_line = file_handle.readline()
                    print file_line
                    file_handle.close()
                except Exception as exp:
                    print '####Exception \'%s\' in reading file %s' % (str(exp), file_name)
                    file_handle.close()
                    continue

        conn_hdfs.close()

    except Exception as e:
        print "####Exception \'%s\' in validating files!" % str(e)



if __name__ == '__main__':

    hdfs_path = '/abc/xyz'
    prepare_data(hdfs_path)

    main(hdfs_path)
maaz
  • 4,371
  • 2
  • 30
  • 48
  • You might want to give the error you get ... – patapouf_ai Apr 19 '15 at 10:38
  • This is more a (possible) bug report than a programming question. If you feel the problem lies with EMR, contact Amazon. If, on the other hand, you think something is wrong with Pydoop, head over to https://github.com/crs4/pydoop/issues. Note that, as of version 1.0.0, Pydoop's HDFS backend has been practically rewritten from scratch, so you might want to retry with the current version. – simleo Jul 13 '15 at 09:19

1 Answers1

4

I suggest using the subprocess module for reading the first line instead of pydoop's conn_hdfs.open_file

import subprocess
cmd='hadoop fs -cat {f}|head -1'.format(f=file_name)
process=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr=process.communicate()
if stderr!='':
    file_line=stdout.split('\n')[0]
else:
     print "####Exception '{e}' in reading file {f}".format(f=file_name,e=stdout)
     continue
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • is that efficient? cat reads full file and then take out first line., also shell=True generally not recommended – maaz Apr 21 '15 at 16:07
  • when piping the hadoop `cat` with the bash `head`, the stream is closed after the first line, and no lines are being read besides that line – Uri Goren Apr 21 '15 at 16:58
  • Regarding performance, I didn't time the two alternatives. However it is a common practice when you need to sample the top `n` lines of an hdsf file – Uri Goren Apr 21 '15 at 17:04
  • See: http://stackoverflow.com/questions/19778137/why-is-there-no-hadoop-fs-head-fs-shell-command – Uri Goren Apr 21 '15 at 17:07