0

I can access S3 files only trough Glue and via Pyspark code as:

s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
    print(bucket.name)

How to find which file has specific value? (e.g. to simulate cat and grep)

Goal is - if I search for word test - to give me list of S3 files that have this word. Files are GZipped.

Joe
  • 11,983
  • 31
  • 109
  • 183
  • Does this answer your question? [How to grep into files stored in S3](https://stackoverflow.com/questions/41179573/how-to-grep-into-files-stored-in-s3) – MyStackRunnethOver Dec 06 '19 at 18:33
  • No because do not have access to `aws s3` - only via Pyspark to the Glue Devendpoint (and see s3 as in the code snippet in question).. – Joe Dec 06 '19 at 18:48

2 Answers2

1

In pyspark, we can search contents in the file like below:

from pyspark.sql.functions import input_file_name

input_path = "data/" # This can be a S3 location
data = spark.read.text(input_path).select(input_file_name(), "value").rdd
df = spark.createDataFrame(data)
df2 = df.filter(df["value"].contains("F1"))


>>> df.show()
+--------------------+--------------------+
|   input_file_name()|               value|
+--------------------+--------------------+
|file:///Users/hbo...|"`F1`","`F2`","`F3`"|
|file:///Users/hbo...|        "a","b","c"'|
|file:///Users/hbo...|         "d","e","f"|
|file:///Users/hbo...|      "F1","F2","F3"|
|file:///Users/hbo...|         "a","b","c"|
|file:///Users/hbo...|         "d","e","f"|
+--------------------+--------------------+

>>> df2 = df.filter(df["value"].contains("F1"))
>>> df2.show()
+--------------------+--------------------+
|   input_file_name()|               value|
+--------------------+--------------------+
|file:///Users/hbo...|"`F1`","`F2`","`F3`"|
|file:///Users/hbo...|      "F1","F2","F3"|
+--------------------+--------------------+

Let me know if this works for you.

Hussain Bohra
  • 985
  • 9
  • 15
0

Even if you can only use boto and not the AWS CLI, your available functionality will be the same (see this question on the differences between the CLI and boto).

Other questions exist on how to grep files in S3, using the CLI, and your approach will have to be similar:

  1. Use the client to get the files' data locally (you probably want to do this file-by-file, or at least batched, assuming the files are somewhat large and/or numerous.
  2. Use either a shell command call (literally grep, for example) or code logic to search the file.
  3. Format your output nicely so each result is tied back to its original S3 file.

cat is even simpler than grep: take a target, get it via the client, and pipe it to standard out.

MyStackRunnethOver
  • 4,872
  • 2
  • 28
  • 42