1

I got this url: s3://dev-datalake-cluster-bucket-q37evqefmksl/raw/wfm/users.11315

I need to have the following values:

  1. dev-datalake-cluster-bucket-q37evqefmksl
  2. /raw/wfm/users.11315

I tried so far this code below, but it keeps throwing me errors -

pattern = re.compile('s3://(?)/(?)', response_content)
print ( re.match(pattern, response_content) )
roeygol
  • 4,908
  • 9
  • 51
  • 88

2 Answers2

1

You can use a negated character class to grab this value using:

^s3://([^/]+)/(.*)

Your value is returned by captured group #1

Code:

>>> s = 's3://dev-datalake-cluster-bucket-q37evqefmksl/raw/wfm/users.11315'

>>> print re.findall(r'^s3://([^/]+)/(.*)', s)
[('dev-datalake-cluster-bucket-q37evqefmksl', 'raw/wfm/users.11315')]

RegEx Demo

Regex Breakup:

  • ^ - Line start
  • s3:// - Match literal s3://
  • ([^/]+) - Match 1 or more of any character that is not /
  • / - Match literal /
  • (.*) - Match rest
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

You can use re.groupdict

>>> re_match = re.match(r's3://(?P<bucket>[^/]+)/(?P<item_path>.*)', s)
>>> re_match.groupdict()
{'bucket': 'dev-datalake-cluster-bucket-q37evqefmksl', 'item_path': 'raw/wfm/users.11315'}

Pythex is a handy resource for regex.

shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90