0

I want my regex to find an url to be able to turn it into a html link.The regex will be used on links that look like the following: www.site.extension and https://site.extension. The regex is \S*.?w{3}.\S+.\S+ and it does give the desired result when using https://regexr.com/. When using my python script however, I get the opposite result of what's desired, so everything that isn't a link is seens as if it were, but the links aren't found.

The python code is:

testbestand = """TESTBESTAND

Div1 kjaskdjfiudhgjnkcvdnbk djskj ij g ijg jkdfnbdiiji jj iikdafnbn ojedfkj giqw34
Akdjfkjasdf

Div2 aksjdfkj sadfkjg sdkjiew kvckjeri cdkj sdkeridk erkire

Div3 kajkdjfkjakdjgsdghijskdg

Div 4 www.link.com

Div5
Table Left  Table Right
Table Left 2    Table Right 2
Table Left 3    Table Right 3
"""

fileContent = testbestand
toAddToFile = ""

#find links
pattern = re.compile(r'\S*\.?w{3}\.\S+\.\S+')
matches = re.split(pattern, fileContent)\

for match in matches:
    match = match.strip()

    if len(match) > 0:
        #TODO change to 'edit' file, instead of adding to it
        test = """<a href=" """ + match + """>" """ + match + "</a>"
        print(test)

        toAddToFile += """<a href=" """ + match + """>" """ + match + "</a>"

Thanks in advance for any help! If more info or code is needed, I'll provide it straight away.

tripleee
  • 175,061
  • 34
  • 275
  • 318
René Steeman
  • 347
  • 4
  • 16

2 Answers2

3

That's because you use re.split, which is designed to split the text at the patterns. Instead, use `re.findall:

pattern = re.compile(r'\S*\.?w{3}\.\S+\.\S+')
matches = pattern.findall(fileContent)
L3viathan
  • 26,748
  • 2
  • 58
  • 81
2

You should use re.sub instead of re.split:

toAddToFile = re.sub(r'(\S*\.?w{3}\.\S+\.\S+)', r'<a href="\1">\1</a>', fileContent)
blhsing
  • 91,368
  • 6
  • 71
  • 106