0

(I will simplify...)

A table with 45M records which increases day by day.

Each night, I have to replace a table in which I have to remove HTML, CSS tags from a nvarchar(max) field

and also convert GMT to UTC from datetime fields (8).

I tested with scalar functions but it was really slow, it takes many hours....

Which is the best way (in term of performance) to do that? I don't need a SQL query, just to explain the way/method...

thank you!

Precision: I have to extract overnight 45M records from a commercial product (I don't have control on the software) and redirect datas to our "home" database on another SQL Server...

I tried many methods, example... SQL - Remove all HTML tags in a string

Convert Datetime column from UTC to local time in select statement

but it's really slow...

navya
  • 45
  • 5
  • The best way in terms of performance would be to have the data be cleaned before it's inserted so you don't have to scrub it nightly. – Jacob H Oct 16 '17 at 19:02
  • 1
    Well what do you expect? This is so extremely vague there is no way we can provide an answer. It could be said that you haven't yet actually asked a question. It seems you need some help but without details we can't provide any direction for you. – Sean Lange Oct 16 '17 at 19:20
  • Use CLR for string processing not TSQL. – Martin Smith Oct 16 '17 at 19:47
  • How many rows are added each day? Can you track *which* rows are added and only clean those? That way you clean as you go. – Steven Hibble Oct 16 '17 at 20:47
  • 1
    From [Wikipedia](https://en.wikipedia.org/wiki/Coordinated_Universal_Time#Mechanism): "For most common and legal-trade purposes, the fractional second difference between UTC and UT (GMT) is inconsequentially small. Greenwich Mean Time is the legal standard in Britain during the winter, and this notation is familiar to and used by the population." Is the error of no more than +/-0.9s really an issue for your application? How do you get the corrections? – HABO Oct 16 '17 at 21:17

1 Answers1

0

Scalar functions will kill you. Especially when dealing with millions of records. Note this article: https://www.sqlservercentral.com/Forums/Topic1338291-203-1.aspx Try using an inline table valued function instead (note the same article).

You may also benefit from parallelizing your query. Here's a documented method for doing so. You can only make your query parallel if you don't use a scalar udf in your query.

The easiest way to strip tags and CSS (provided that it's defined in the html header) would be to use SQLXML methods. Here's an examaple which will help get you started.

declare @somehtml nvarchar(max) =
'<html>
  <head>
    <title>A Web Page</title>

    <style>
    body {background-color: powderblue;}
    h1   {color: blue;}
    p    {color: red;}
    </style>

  </head>

  <body>
    <h1>News Flash</h1>
    <p>We lulled LA into a false sense of security.</p>
    <p>0-2 means nothing. The Cubs will win. </p>
  </body>
</html>'

declare @cleaned nvarchar(max) ='';

select @cleaned  += z.xx.value('.', 'nvarchar(max)')+char(10)
from (values (cast(@somehtml as xml))) x(xx)
cross apply xx.nodes('/html/body/node()') z(xx);

print @cleaned;

Returns:

News Flash

We lulled LA into a false sense of security.

0-2 means nothing.

The Cubs will win.

Community
  • 1
  • 1
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • 1
    really thank! that's what I was hoping -- my reputation is too low to vote to your answer... tanx again! – navya Oct 17 '17 at 11:27