How to split with Dot without splitting links

Question

I want to split on dot (.) but I don't want to splits the links.

Let's say the string is -

<p>This is a paragraph. I want to split it. <a href="somesite.com">Link</a>

Expected Output -

'<p>This is a paragraph' ,'I want to split it' ,'<a href="somesite.com">Link</a>'

Current Output -

'<p>This is a paragraph' ,'I want to split it' ,'<a href="somesite', 'com">Link</a>'

Note that I don't want the link to split. Also, I know you can split it using .split(".") but how can I not split that link?

Could do describe what you do want to split on rather than what you don't want to split on, maybe something like a period indicating the end of a sentence so one followed by a space? — doctorlove, Apr 08 '21 at 08:27
Hey @GehanFernando, I just updated the question, please give it a look. :) — Mukeshwar Singh, Apr 08 '21 at 08:33
Hey @doctorlove, I just added it in the question, please give it a glance. — Mukeshwar Singh, Apr 08 '21 at 08:33
Don't use regex to parse xml/html: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — doctorlove, Apr 08 '21 at 08:46
Using regex to split up a tree structure is going to be (almost) impossible. Why do you want this split? What are you actually trying to achieve? — doctorlove, Apr 08 '21 at 08:48
Please check the question again, I edited and mentioned it there. — Mukeshwar Singh, Apr 08 '21 at 08:50
What you want to do is very similar to using a regex to find a URL in a string, so you may be able to modify the code in the article [Check for URL in a String](https://www.geeksforgeeks.org/python-check-url-string/). — martineau, Apr 08 '21 at 08:53
SEE, I want to split everything with DOT(.) in python, but I don't want the URL to split also. BUt since it contains a dot(google.com) python will split it too. But I want the opposite. — Mukeshwar Singh, Apr 08 '21 at 08:59

score 1 · Answer 1 · edited Apr 08 '21 at 09:42

Use an html parser (e.g. this). Spot a paragraph start and then split the data in there like this:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.in_paragraph = False
        super(MyHTMLParser, self).__init__()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        if tag == 'p':
            self.in_paragraph = True
        else:
            self.in_paragraph = False

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        if self.in_paragraph:
            data = data.split('.')
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<p>This is a paragraph. I want to split it. <a href="somesite.com">Link</a>')

Encountered a start tag: p
Encountered some data  : ['This is a paragraph', ' I want to split it', ' ']
Encountered a start tag: a
Encountered some data  : Link
Encountered an end tag : a

Hey Thanks for answering, but I think you didn't saw my expected out. '
This is a paragraph' ,'I want to split it', 'Link' — Mukeshwar Singh, Apr 08 '21 at 09:54
I'm suggesting you can us the html parser to achieve this - I haven't written code for you to get the exact output you wanted. You can get there by changing the print statements — doctorlove, Apr 08 '21 at 09:58

score 0 · Accepted Answer · answered Apr 08 '21 at 08:31

0

Solution 1: Strings objects have a method called 'split':

s = 'google.com'

splitted = s.split('.')

print(splitted)

>>> ['google', 'com']

That takes a string and split by a substring such as '.'.

Solution 2: find the position of '.' in the string, then split it manually:

s = 'google.com'

idx = s.indexOf('.')

first = s[:idx]

sec = s[idx:]

print(first)
>>> google

print(sec)
>>> .com

answered Apr 08 '21 at 08:31

Daniel111

61
1
4

Thanks for the answer, but it's not as expected. Of course, it was my mistake, please re check the question, I have added somethings. – Mukeshwar Singh Apr 08 '21 at 08:35

score 0 · Answer 3 · answered Apr 08 '21 at 09:26

0

I don't think what you are trying to do can be done with a regex.

The simplest approach is to simply split by ".", then iterate over the result list and search each string for "<a " and if you find one, rejoin the subsequent result list elements until you find a "</a>".

answered Apr 08 '21 at 09:26

Deeepdigger

63
7

See, it will split it by - 'This is a paragraph', 'Link' However, I want it as 'This is a paragraph', ' – Mukeshwar Singh Apr 08 '21 at 09:55

How to split with Dot without splitting links

3 Answers3