My goal is to be able to identify all of the paths to Streams (files) within a MapR cluster filesystem.
Working through the problem I've identified that within a MapR cluster, Streams are stored as links to MapR Tables with read-only permissions.
These can easily be discovered using:
ls -alR -1 /mapr |grep 'lr-------- 1 mapr mapr'
lr-------- 1 mapr mapr 2 Jan 24 13:02 f -> mapr::table::2129.42.131292
lr-------- 1 mapr mapr 2 Jan 27 12:49 transactions -> mapr::table::2129.48.393912
lr-------- 1 mapr mapr 2 Jan 3 12:52 customers -> mapr::table::2129.36.131280
lr-------- 1 mapr mapr 2 Jan 3 16:47 creditcards -> mapr::table::2129.39.131286
lr-------- 1 mapr mapr 2 Jan 3 12:40 databroker -> mapr::table::2129.33.131274
lr-------- 1 mapr mapr 2 May 25 13:00 drill_test -> mapr::table::2049.12355.3399972
lr-------- 1 mapr mapr 2 Jun 14 05:23 geo -> mapr::table::2049.22145.4864546
lr-------- 1 mapr mapr 2 Jun 7 10:36 bonus -> mapr::table::2049.26487.4074656
Two problems remain:
The output displayed is both Stream files AND MapR-DB Tables; further identification could be performed using maprcli commands, but in order to do this I need the full path in order to pipe the files into another program...
Obtaining the paths is easily performed using the solution here: ls command: how can I get a recursive full-path listing, one line per file?
But then the identification mask in the grep command can't be applied to shortlist, and I'm left with a list of all the files in the cluster.
One approach I thought might work was to extract the specific file links in question using:
ls -alR -1 /mapr |grep 'lr-------- 1 mapr mapr' |awk '{ print $9 }'
which results in:
f
transactions
customers
creditcards
databroker
drill_test
geo
bonus
and then pipe them into a find loop (or something"?), but this performs poorly.
Does anyone have an approach that outputs the path and filename recursively alongside permissions allowing for filtering like demonstrated in the GREP command? The (what I consider safe) assumption I'm making is that within the cluster, only MapR-DB Tables and MapR Streams will have these permissions, and from a data management perspective, identifying both will provide benefit as services appear within a cluster that start writing data we haven't yet captured in downstream systems (reporting, ETL, etc).
Better yet, the magic bullet is generating a list of MapR Streams registered in the cluster some other (more reliable) way. ;)