14

I need to loop over all csv files in a Hadoop file system. I can list all of the files in a HDFS directory with

> hadoop fs -ls /path/to/directory
Found 2 items
drwxr-xr-x   - hadoop hadoop          2 2016-10-12 16:20 /path/to/directory/tmp
-rwxr-xr-x   3 hadoop hadoop 4691945927 2016-10-12 19:37 /path/to/directory/myfile.csv

and can loop over all files in a standard directory with

for filename in /path/to/another/directory/*.csv; do echo $filename; done

but how can I combine the two? I've tried

for filename in `hadoop fs -ls /path/to/directory | grep csv`; do echo $filename; done

but that gives me some nonsense like

Found
2
items
drwxr-xr-x

hadoop
hadoop
2    
2016-10-12
....
codeforester
  • 39,467
  • 16
  • 112
  • 140
Sal
  • 1,653
  • 6
  • 23
  • 36
  • `hadoop fs -ls /path/to/directory | grep csv` should give you a list of lines of standard out, not necessarily just filenames. – OneCricketeer Oct 13 '16 at 01:35
  • See in another question a nice way todo a loop: http://stackoverflow.com/questions/28685471/loop-through-hdfs-directories – Chananel P Feb 08 '17 at 07:29

2 Answers2

14

This should work

for filename in `hadoop fs -ls /path/to/directory | awk '{print $NF}' | grep .csv$ | tr '\n' ' '`
do echo $filename; done
matesc
  • 392
  • 4
  • 7
  • This works like a charm! But it prints the entire path to the file. How can I cut it short so that it prints only the file name?? – user3270763 Feb 15 '17 at 19:19
  • 1
    For anyone looking for a similar solution, use 'cut' to get the substring. $(echo $filename | cut -f4 -d/) – user3270763 Feb 15 '17 at 21:29
  • 1
    I can refer to http://stackoverflow.com/questions/965053/extract-filename-and-extension-in-bash for short – matesc Feb 15 '17 at 21:32
  • It would be great if someone could explain how this works – andrew Oct 05 '18 at 19:41
  • It works for me when I run it in the shell, but when I run it through a script, the loop runs only once. The output is a single string that contains the full filename of each file in the directory. The trim operation removes the newline character and replaces it with a space and turns every line of the -ls output to a single space separated line. How can I fix this? – Amber May 21 '19 at 16:47
3

The -C option will display only the file paths.

for filename in $(hadoop fs -ls -C /path/to/directory/*.csv); do
    echo "${filename}"
done
vargaslo
  • 86
  • 2