Extract string with Python re.match

Question

import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"

str2=re.match("[a-zA-Z]*//([a-zA-Z]*)",str)
print str2.group()

current result=> error
expected => wwwqqqzzz

I want to extract the string wwwqqqzzz. How I do that?

Maybe there are a lot of dots, such as:

"whatever..s#$@.d.:af//wwww.xxx.yn.zsdfsd.asfds.f.ds.fsd.whatever/123.dfiid"

In this case, I basically want the stuff bounded by // and /. How do I achieve that?

One additional question:

import re
str="xxx.yyy.xxx:80"

m = re.search(r"([^:]*)", str)
str2=m.group(0)
print str2
str2=m.group(1)
print str2

Seems that m.group(0) and m.group(1) are the same.

yes, i just want purely characters [a-zA-Z]* between //and /, before '//' has bunch characters, also after '/' at the end, — runcode, Nov 16 '12 at 20:08

Martin Ender · Accepted Answer · 2012-11-16T20:13:26.813

42

match tries to match the entire string. Use search instead. The following pattern would then match your requirements:

m = re.search(r"//([^/]*)", str)
print m.group(1)

Basically, we are looking for /, then consume as many non-slash characters as possible. And those non-slash characters will be captured in group number 1.

In fact, there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind:

m = re.search(r"(?<=//)[^/]*", str)
print m.group()

Lookarounds are not included in the actual match, hence the desired result.

This (or any other reasonable regex solution) will not remove the .s immediately. But this can easily be done in a second step:

m = re.search(r"(?<=//)[^/]*", str)
host = m.group()
cleanedHost = host.replace(".", "")

That does not even require regular expressions.

Of course, if you want to remove everything except for letters and digits (e.g. to turn www.regular-expressions.info into wwwregularexpressionsinfo) then you are better off using the regex version of replace:

cleanedHost = re.sub(r"[^a-zA-Z0-9]+", "", host)

edited Nov 16 '12 at 20:13

answered Nov 16 '12 at 20:07

Martin Ender

43,427
11
90
130

1

sorry, I just saw that requirement. simply run another step: `resultstr.replace(r".", "")`. Will include that in a second. – Martin Ender Nov 16 '12 at 20:11
"there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind" - Do you have anything to back this up? Both my intuition and `timeit` tell me that lookbehinds are slower then a simple group capture. – lqc Nov 16 '12 at 21:36
what does it mean by group(0) ,group(1), seems group(0) result same as group(1) in my case, added on question – runcode Nov 16 '12 at 21:48
@runcode `group(0)` gives you the complete match. `group(1)` gives you what was matched with everything inside the first set of parentheses. in your example you wrapped your whole pattern in parentheses. hence, both calls give the same result. – Martin Ender Nov 16 '12 at 23:18
@lqc, I don't have any source at hand no. I believe it mostly applies for more complex patterns, where things would be captured multiple times. after all, the engine needs to keep track of what was matched since it entered a capturing group. In any specific case, the lookbehind might be less efficient, I admit. – Martin Ender Nov 16 '12 at 23:22

Ωmega · Answer 2 · 2012-11-16T20:25:24.657

3

print re.sub(r"[.]","",re.search(r"(?<=//).*?(?=/)",str).group(0))

See this demo.

edited Nov 16 '12 at 20:25

answered Nov 16 '12 at 20:19

Ωmega

42,614
34
134
203

score 2 · Answer 3 · edited Aug 14 '14 at 16:34

2

output=re.findall("(?<=//)\w+.*(?=/)",str)

final=re.sub(r"[^a-zA-Z0-9]+", "", output [0])

print final

edited Aug 14 '14 at 16:34

Uyghur Lives Matter

18,820
42
108
144

answered Aug 14 '14 at 15:59

John F

99
7

score 0 · Answer 4 · edited May 17 '21 at 13:09

0

import re
str_1="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"

str2=re.match(".*//([a-zA-Z.]*)",str_1)
print(str2.group(1).replace('.',''))

edited May 17 '21 at 13:09

RavinderSingh13

130,504
14
57
93

answered May 17 '21 at 13:06

Anand K

11
2

3

While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Dharman May 17 '21 at 13:07

score -1 · Answer 5 · edited Jan 16 '17 at 11:36

-1

import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"
re.findall('//([a-z.]*)', str)

edited Jan 16 '17 at 11:36

BDL

21,052
22
49
55

answered Jan 16 '17 at 10:58

nitinvijay23

1,781
3
13
11

1

Although the code might solve the problem, it is not an answer on its own. One should always add an explanation to it. – BDL Jan 16 '17 at 11:36

Extract string with Python re.match

5 Answers5

Linked

Related