0

I have a function that parses HTML code so it is easy to read and write with. In order to do this I must split the string with multiple delimiters and as you can see I have used re.split() and I cannot find a better solution. However, when I submit some HTML such as this, it has absolutely no effect. This has lead me to believe that my regular expression is incorrectly written. What should be there instead?

def parsed(data):
    """Removes junk from the data so it can be easily processed."""
    data = str(data)
    # This checks for a cruft and removes it if it exists.
    if re.search("b'", data):
        data = data[2:-1]
    lines = re.split(r'\r|\n', data)  # This clarifies the lines for writing.
    return lines

This isn't a duplicate if you find a similar question, I've been crawling around for ages and it still doesn't work.

anon582847382
  • 19,907
  • 5
  • 54
  • 57
  • 2
    *I have a function that parses HTML code so it is easy to read and write with.* Ouch. Why not use a HTML parser instead? BeautifulSoup does this in one. – Martijn Pieters Feb 20 '14 at 13:43
  • 7
    `from bs4 import BeautifulSoup`, `print(BeautifulSoup(data).prettify())`. – Martijn Pieters Feb 20 '14 at 13:44
  • Obligatory link: http://stackoverflow.com/a/1732454/10077 – Fred Larson Feb 20 '14 at 13:45
  • I looked at HTMLParser, but I couldn't see how that would remove the special characters if you know what I mean. I want the program to be as portable as possible so I dislike using external modules if at all possible. – anon582847382 Feb 20 '14 at 13:46
  • 1
    Last but not least, if your `data` value contains `b''`, then you called `str()` on a `bytes` value, instead of decoding it to a string. Don't do that either. And `str.splitlines()` does what your regular expression does with a built-in method. – Martijn Pieters Feb 20 '14 at 13:46
  • 2
    @FredLarson: the OP is trying to split on line delimiters, hardly call for summoning Zalgo here. :-) – Martijn Pieters Feb 20 '14 at 13:46
  • @MartijnPieters: Well, I saw "parse HTML" and "regular expression" in the same post. Maybe I was too quick on the trigger. – Fred Larson Feb 20 '14 at 13:48

1 Answers1

2

You are converting a bytes value to string:

data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
    data = data[2:-1]

which means that all line delimiters have been converted to their Python escape codes:

>>> str(b'\n')
"b'\n'"

That is a literal b, literal quote, literal \ backslash, literal n, literal quote. You would have to split on r'(\\n|\\r)' instead, but most of all, you shouldn't turn bytes values to string representations here. Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object.

You want to decode to string instead:

if isinstance(data, bytes):
    data = data.decode('utf8')

where I am assuming that the data is encoded with UTF8. If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type header, look for the charset= parameter.

A response produced by the urllib.request module has an .info() method, and the character set can be extracted (if provided) with:

charset = response.info().get_param('charset')

where the return value is None if no character set was provided.

You don't need to use a regular expression to split lines, the str type has a dedicated method, str.splitlines():

Return a list of the lines in the string, breaking at line boundaries. This method uses the universal newlines approach to splitting lines. Line breaks are not included in the resulting list unless keepends is given and true.

For example, 'ab c\n\nde fg\rkl\r\n'.splitlines() returns ['ab c', '', 'de fg', 'kl'], while the same call with splitlines(True) returns ['ab c\n', '\n', 'de fg\r', 'kl\r\n'].

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343