-3

I want to split on dot (.) but I don't want to splits the links.

Let's say the string is -

<p>This is a paragraph. I want to split it. <a href="somesite.com">Link</a>

Expected Output -

'<p>This is a paragraph' ,'I want to split it' ,'<a href="somesite.com">Link</a>'

Current Output -

'<p>This is a paragraph' ,'I want to split it' ,'<a href="somesite', 'com">Link</a>'

Note that I don't want the link to split. Also, I know you can split it using .split(".") but how can I not split that link?

  • 2
    Can you please provide input string and expected output? – Gehan Fernando Apr 08 '21 at 08:25
  • Could do describe what you do want to split on rather than what you don't want to split on, maybe something like a period indicating the end of a sentence so one followed by a space? – doctorlove Apr 08 '21 at 08:27
  • Hey @GehanFernando, I just updated the question, please give it a look. :) – Mukeshwar Singh Apr 08 '21 at 08:33
  • Hey @doctorlove, I just added it in the question, please give it a glance. – Mukeshwar Singh Apr 08 '21 at 08:33
  • 2
    Don't use regex to parse xml/html: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – doctorlove Apr 08 '21 at 08:46
  • Understood, but is there any other solution? – Mukeshwar Singh Apr 08 '21 at 08:48
  • Using regex to split up a tree structure is going to be (almost) impossible. Why do you want this split? What are you actually trying to achieve? – doctorlove Apr 08 '21 at 08:48
  • Please check the question again, I edited and mentioned it there. – Mukeshwar Singh Apr 08 '21 at 08:50
  • What you want to do is very similar to using a regex to find a URL in a string, so you may be able to modify the code in the article [Check for URL in a String](https://www.geeksforgeeks.org/python-check-url-string/). – martineau Apr 08 '21 at 08:53
  • SEE, I want to split everything with DOT(.) in python, but I don't want the URL to split also. BUt since it contains a dot(google.com) python will split it too. But I want the opposite. – Mukeshwar Singh Apr 08 '21 at 08:59

3 Answers3

1

Use an html parser (e.g. this). Spot a paragraph start and then split the data in there like this:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.in_paragraph = False
        super(MyHTMLParser, self).__init__()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        if tag == 'p':
            self.in_paragraph = True
        else:
            self.in_paragraph = False

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_paragraph:
            data = data.split('.')
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<p>This is a paragraph. I want to split it. <a href="somesite.com">Link</a>')
Encountered a start tag: p
Encountered some data  : ['This is a paragraph', ' I want to split it', ' ']
Encountered a start tag: a
Encountered some data  : Link
Encountered an end tag : a
s_baldur
  • 29,441
  • 4
  • 36
  • 69
doctorlove
  • 18,872
  • 2
  • 46
  • 62
0

Solution 1: Strings objects have a method called 'split':

s = 'google.com'

splitted = s.split('.')

print(splitted)

>>> ['google', 'com']

That takes a string and split by a substring such as '.'.

Solution 2: find the position of '.' in the string, then split it manually:

s = 'google.com'

idx = s.indexOf('.')

first = s[:idx]

sec = s[idx:]

print(first)
>>> google

print(sec)
>>> .com
Daniel111
  • 61
  • 1
  • 4
  • Thanks for the answer, but it's not as expected. Of course, it was my mistake, please re check the question, I have added somethings. – Mukeshwar Singh Apr 08 '21 at 08:35
0

I don't think what you are trying to do can be done with a regex.

The simplest approach is to simply split by ".", then iterate over the result list and search each string for "<a " and if you find one, rejoin the subsequent result list elements until you find a "</a>".