2

Using command "file <filename>" in linux displays whether the file is compressed or not. How to achieve this for a file residing in HDFS file system?

file 620591952596020.gz
620591952596020.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

file 269146229598756
269146229598756: ASCII text, with very long lines

This will help me to avoid compressing a file (GZip) which is already compressed as part of Shell script invoked via Apache Oozie.

#!/bin/bash
HDFS_IN_PATH=$1;
IS_COMPRESS_FILE=true;

for archiveDir in 'ARCHIVE1' 'ARCHIVE2' ;
do
    HDFS_OUT_PATH=${HDFS_IN_PATH}/$archiveDir;

    for ls_entry in $(hdfs dfs -ls -C "$HDFS_IN_PATH"/$archiveDir);
    do
        fileAbsPath=$ls_entry;
        jobName=$(basename "${fileAbsPath}");

        if (hadoop fs -test -f "$fileAbsPath") ; then
            echo "Its not a directory ${fileAbsPath}"
            continue;
        fi

        for file in $(hdfs dfs -ls -C "$fileAbsPath");
        do
            filename=$(basename "${file}");

            if [ "$IS_COMPRESS_FILE" = true ]; then

              if(<<***COMMAND TO CHECK HDFS FILE ${file} IS COMPRESSED***>>); then
                  echo "File Name: ${file} is already compressed.."
                  continue;
              fi

              hadoop fs -cat "${file}" | gzip | hadoop fs -put - "${file}".gz;

              echo "Successfully compressed file..!";
            fi
        done

        hadoop archive -archiveName "${jobName}".har -p "${HDFS_OUT_PATH}" "${jobName}" "${HDFS_OUT_PATH}";
    done
done
Vasanth Subramanian
  • 1,040
  • 1
  • 13
  • 32
  • This might help: https://stackoverflow.com/a/21196774/5372462 – ofirule Aug 31 '20 at 20:58
  • 2
    there is not such a command in Hadoop, but I thing by using this link https://unix.stackexchange.com/questions/151008/linux-file-command-classifying-files you can implement it by yourself. only read some bytes and determine it is compressed or not – badger Sep 05 '20 at 18:33

1 Answers1

0

HDFS doesn't have a command like file in Linux. Instead, reading the extensión might work: if [ "$file" == "*.gz" ]

Other options that requires coding in python or Java are:

  • Managing ZipFileInputFormat to ensure that the zip file is a real compressed content.
  • PySpark seems to have an option that can be used in the form zipfile.ZipFile(in_memory_data, "r").

Both are addresed in this link.

rsantiago
  • 2,054
  • 8
  • 17