0

I have a script that stores XML (saved as str's) requested from various websites. Some of these requests are being truncated even though the SQL column type is NVARCHAR(MAX). I need to know why. I'm not sure if its a timing issue (The response is stored to SQL before its finished downloading) or a paging(?) issue.

The issue presents itself around 62k to 65k characters and above. Anything below this char count and the file is saved appropriately.

Code:

import urllib.error
import urllib.request
from bs4 import BeautifulSoup

def get_feed(url):
    try:
        header = {'User-Agent': 'Mozilla/5.0'}
        request = urllib.request.Request(url=url, headers=header)
        response = urllib.request.urlopen(request)
        xml = BeautifulSoup(response, 'xml')
        return xml
        
    except urllib.error.HTTPError as response:
        return None


def get_sql_connection():
    try:
        server = 'mydb.database.windows.net'
        database = 'mydb'
        driver= '{ODBC Driver 17 for SQL Server}'
        conn_string = 'DRIVER=' + driver + ';SERVER=' + server + ';DATABASE=' + database
        conn = pyodbc.connect(conn_string + ';Authentication=ActiveDirectoryMsi')
        return conn
    except Exception as e:
        logging.error(e)
        return None


def write_xml(conn, xml):
    end_format = '%Y-%m-%dT%H:%M:%S%z'
    cursor = conn.cursor()
    data = {
        "xml_doc": str(xml),
        "created": datetime.now(timezone(timedelta(hours=-8))).strftime(end_format)}
    cursor.execute("INSERT INTO dbo.XML (xml_doc, created) VALUES (?,?)", data['xml_doc'], data['created'])
    conn.commit()

#-------------------------

conn = get_sql_connection()

url = 'https://example.com'

xml = get_feed(url)

write = write_feed_table(conn, xml)

Again... anything beyond char count of ~62k-65k, the xml is truncated.

SeaDude
  • 3,725
  • 6
  • 31
  • 68
  • 4
    How are you validating that they are being truncated? By selecting the data from SSMS? It's limited to 65,535 characters of output, so make sure you aren't being fooled by your tool, and check `DATALENGTH()` of the stored column value. (If the data really is truncated _on write_ then it won't be because it "stored before it finished downloading" - but your driver may be enforcing some [`SET TEXTSIZE`](https://learn.microsoft.com/en-us/sql/t-sql/statements/set-textsize-transact-sql) limitation on your behalf, maybe try issuing `SET TEXTSIZE <2 billion>;` before your query.) – Aaron Bertrand Jan 06 '22 at 04:25
  • Wow, you know what.. I was selecting the data from SSMS (selecting column, then `Copy`). I bet thats it! Crazy that the char count pointed right to it! I'll verify. Thank you! – SeaDude Jan 06 '22 at 04:41
  • Confirmed! That was the issue. – SeaDude Jan 06 '22 at 05:29

1 Answers1

1

64K is kind of a magic number, especially with XML, but also some providers and other software will issue short SET TEXTSIZE commands (e.g. 32K or 64K) on your behalf. You can override this with:

SET TEXTSIZE -1;

But, typically, when people say their text has been truncated at 64K, it's not actually the case, they're just believing what a UI is showing them. To be certain, don't measure the output of the UI, check the source!

SELECT DATALENGTH(large_column) FROM dbo.table WHERE <this row>;

I suspect you are storing XML data in a non-XML column (or converting it to a string on output), and you are copying the results from the grid output in SSMS, and then measuring the length of that string. Which would make you think the data stored is invalid, but it's not. You are just a victim of the max output by default for grids:

enter image description here

You can increase that number, but I'll admit I haven't played with it that much. For XML specifically (and sometimes even not for XML), it is much better to display the output as a proper XML column. This makes it a clickable cell in the grid and, when you click, it opens a document that is not truncated the way text output is (well, unless you exceed the default max of 2 MB, which you also increase to 5MB or, if you're really brave, unlimited).

For some other potential workarounds, see:

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490