Python miscounts len() for multiline string

Question

Given a simple program for testing CGI with Apache server

#!C:/Python311/python.exe

html = """<!doctype html />
<html>
<head>
</head>
<body>
    <h1>Hello CGI World</h1>
</body>
</html>"""
print( "Content-Type: text/html" ) 
print( f"Content-Length: {len(html)}" )
print( "" )                         
print( html )

The problem in len(html) result less than actual. In editor (fig.1) we see 98 selected symbols.

But in browser we see 91 symbol

And response body cropped by it length

I tried to display string symbol-by-symbol in Python console and found out that '\n' symbols comes alone while in editor and browser they are '\r\n' (my suggestion). In any case single-line string has no problem.

I tried to replace '\n' for '\r\n' (.replace('\n','\r\n')) but the problem not solves, browser shows extra 'CR' symbools and body still cropped.

Thanks forward for any ideas

@mozway: What? `html` doesn't even have any leading or trailing whitespace to strip. — user2357112, Dec 16 '22 at 08:25
It doesn't matter, It also works directly in Python interpreter. — Cow, Dec 16 '22 at 08:28
@DNS But multistring in Python will always use \n as new line. — Cow, Dec 16 '22 at 08:31
@DNS Well, you wrote the html variable by hand, the output from a browser in some form might give a different result. — Cow, Dec 16 '22 at 08:32
Browser gets 98 symbols, all OK. I looking a way to get content length in Python code — DNS, Dec 16 '22 at 08:34
Then that is a new question, try to setup an example using Selenium or Beautiful Soup to get the html data and count that data and see if it works that way. — Cow, Dec 16 '22 at 08:35
I think the web server is supposed to handle the Content-Length header for you - I don't think you actually need to provide it yourself. Have you tried just not printing a Content-Length header? — user2357112, Dec 16 '22 at 08:38

score 1 · Answer 1 · answered Dec 16 '22 at 08:29

If I replace \n with \r\n I get exactly 98.

html = """<!doctype html />
<html>
<head>
</head>
<body>
    <h1>Hello CGI World</h1>
</body>
</html>"""
print("Content-Type: text/html")
html_length = len(html.replace('\n', '\r\n'))
print(f"Content-Length: {html_length}")
print("")

Result:

Content-Type: text/html
Content-Length: 98

In Python interpreter:

Python 3.9.10 (tags/v3.9.10:f2f3f53, Jan 17 2022, 15:14:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> html = """<!doctype html />
... <html>
... <head>
... </head>
... <body>
...     <h1>Hello CGI World</h1>
... </body>
... </html>"""
>>> print("Content-Type: text/html")
Content-Type: text/html
>>> html_length = len(html.replace('\n', '\r\n'))
>>> print(f"Content-Length: {html_length}")
Content-Length: 98

Thanks, it does work. But in this way we make a copy of `html` just for length computing. Seems to be more efficient algo — DNS, Dec 16 '22 at 08:38
I also get OK in such way `extra = html.count('\n')` `print( f"Content-Length: {len(html) + extra}" )` But I don't like this solution. Python is a great lang, it should have pretty solution) — DNS, Dec 16 '22 at 08:50

user2357112 · Answer 2 · 2022-12-16T09:16:05.823

The Content-Length header is supposed to give the size of the message body in bytes. That is not the same as the length of the html string, because you're on Windows, and the \n characters get translated to Windows \r\n line breaks when you print them. Each line break becomes two characters.

Additionally, any characters that get encoded to more than 1 byte in the encoding specified by sys.stdout.encoding will also cause a length mismatch (and if sys.stdout.encoding is something weird, you might not be able to print some characters, or the browser might not understand what it's looking at).

You don't need to provide a Content-Length header in a CGI script - the web server will handle it for you. If you really want to compute Content-Length yourself, though, you can perform newline translation and encoding and check the length of the resulting bytestring:

import sys

temp = html
if sys.platform == 'win32':
    temp = temp.replace('\n', '\r\n')
temp = temp.encode(sys.stdout.encoding)
content_length = len(temp)

You can also explicitly set sys.stdout.encoding with sys.stdout.reconfigure, or change line break translation behavior:

# Sets sys.stdout.encoding to 'utf-8'
sys.stdout.reconfigure(encoding='utf-8')

# Disables \n -> \r\n translation
sys.stdout.reconfigure(newline='\n')

or write arbitrary bytes directly to sys.stdout.buffer if you want more control.

Thanks! But I suppose that there is algo without `replace` making a copy of string. In my example html a little string but real html is much more longer. In such case it is really better to exclude Content-Length header. But in other case(s) the problem may arise again... — DNS, Dec 16 '22 at 09:09

Python miscounts len() for multiline string

2 Answers2