How to sanitise a block of text Python 3 no external modules?

Question

Recently was set a hackerrank to do and I couldn't get a block of text to properly be sanitized from tags without breaking the text in Python 3.

Two example inputs were provided (below) and the challenge was to clear them to make them safe normal text blocks. Time to complete the challenge is over but I'm confused how I got something so simple so wrong. Any help on how I should've gone about it would be appreciated.

Test input one

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
var y=window.prompt("Hello")
window.alert(y)
</script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

Test input two

In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work.  The full details of your in-text references, <script language="JavaScript">
document.write("Page. Last update:" + document.lastModified); </script>When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. 
The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

Test proposed output 1

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

Test proposed output 2

  In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work. The full details of your in-text references, When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

Thanks in advance!

EDIT (Using @YakovDan's sanitisation) : The code:

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


    return out_str

inp=input()
print(sanitize(inp))

The input:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
 var y=window.prompt("Hello")
 window.alert(y)
 </script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

The output:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy.

What the output should be:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy.Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

Please clarify what is to be done. Can you provide an example output? Can you also explain what have you tried already? If I understand correctly, you have some text mixed with <> tags, and you need to clear the tags? — Yakov Dan, Oct 16 '18 at 20:10
@YakovDan Thanks for your response again! I have edited the main post with the code, input, output and what I think the output should be. The issue is that after clearing the tags it seems to delete the rest of the text that comes after it that is perfectly fine and not malicious. — James Odo, Oct 17 '18 at 11:11
I can't replicate the issue. The same code runs well on my end. Can you add the code you use to call the function? — Yakov Dan, Oct 17 '18 at 12:49
@YakovDan Thanks for getting back to me. You can see exactly how I have it running here, and if you paste the input from the main post you should receive the output I'm getting - https://repl.it/repls/FormalStiffPipelining — James Odo, Oct 17 '18 at 13:19
ok, here's the thing: if you print the input string without sanitizing it, it cuts off after — Yakov Dan, Oct 17 '18 at 13:38

Joe Iddon · Answer 1 · 2018-10-16T20:12:58.390

0

In general, regular expressions are the wrong tool for parsing HTML tags (see here), but it will work for this job since the tags are simple - if you have non-regular (tags which don't have closing tags etc.) inputs, it will fail.

That being said, for this two examples, you can use this regex:

<.*?>.*?<\s*?\/.*?>

Implemented in Python:

import re
s = one of your long strings
r = re.sub('<.*?>.*?<\s*?\/.*?>', '', s, flags=re.DOTALL)
print(r)

which gives the expected results (too long-winded to copy them in!).

edited Oct 16 '18 at 20:12

answered Oct 16 '18 at 20:06

Joe Iddon

20,101
7
33
54

Thanks! does work although the challenge was to do it without regex (should have said) but this would do the trick perfectly. – James Odo Oct 16 '18 at 20:36
@JamesOdo I'm sorry, but as soon as nested tags and other complications have to be considered, the problem gets too long for me to write from scratch. You are essentially asking for a full [X]HTML parser which is difficult to implement! If nested tags aren't a requirement, then you can implement a FSM that has a state of either "within tags" or "not within tags". Then as you iterate over the characters, you have two decisions: do I modify my state? do I add this character the output. And that's it. Hopefully you can manage the implementation yourself - then the work will be yours :) – Joe Iddon Oct 16 '18 at 20:48
@JamesOdo Note that your state may have to have two parts, not just "am I within tags", because you need to account for when you are literally within a tag (such as the `"p"` in the `" – Joe Iddon Oct 16 '18 at 20:50
Thanks for your response again. As I thought it's not as straight forward without regex, which seems odd as the test is fairly short and it wouldn't allow me to use the module. Thanks for your suggestions though, I'll give it a crack myself now I have some more time. – James Odo Oct 16 '18 at 20:54

score 0 · Accepted Answer · answered Oct 16 '18 at 21:50

Here's a way to do this without regex.

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


     return out_str

This should do it (up to assumptions about tags)

May have spoken slightly prematurely, although this cleans all the — James Odo, Oct 16 '18 at 22:38

How to sanitise a block of text Python 3 no external modules?

2 Answers2