0

I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:

SELECT 
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
    CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
    FROM tbl

This query works really slow even if I choose only top 1 row. Actually I am not able to get any result, or at least I cannot wait such a long time. The column that I am checking is of type decimal if this is of any importance.

enter image description here

The number of records in the table is: 37,431,866. This is why I choose only top 1000 for example, but still does not load any result for more than 40 minutes

Yana
  • 785
  • 8
  • 23
  • 1
    Please add details about the execution plan (see: https://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan-in-sql-server), and the DDL for the table `tbl`. How many records are in that table? – Luuk Dec 18 '20 at 14:03
  • I have edited it in my question – Yana Dec 18 '20 at 14:06
  • Do you have a DDL (the create table statement for this table). Because of the `SUM(...)` the complete table has to be read., so doing `TOP 1000` will not help. – Luuk Dec 18 '20 at 14:11
  • 1
    Try `SELECT ... FROM (select top(1000) * from tbl) t` to limit a scan range. Any `varchar(max)` columns or something? – Serg Dec 18 '20 at 14:16
  • No, I don't have DDL. I don't have permissions. But the `SUM (...)` part sounds like the problem. Is there a way I can transform this part? – Yana Dec 18 '20 at 14:16

2 Answers2

1

If you want this to run faster, then you don't want to limit the rows in the query you are using. After all, an aggregation query with no GROUP BY only returns one row.

Instead use a subquery:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
     ) t

Note that this is not a random sample. And if you attempt ORDER BY newid() you will kill performance. One alternative to get an approximate n% sample is to use logic like this:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
      WHERE RAND(CHECKSUM(NEWID())) < 0.001
     ) t

The 0.001 would be about a 0.1% sample.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you. I will try your solution. Now the problem stays, what happens if I want to check the whole table? Will I have to wait an eternity to get result? – Yana Dec 18 '20 at 14:20
  • You example with `RAND ...` returns null values. – Yana Dec 18 '20 at 14:24
  • @Yana . . . Wow! I knew that construct was a less than perfect random number generator, but I didn't realize it was that bad: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=a182662e35708e8a0f0833b50b8e9a8f. – Gordon Linoff Dec 18 '20 at 15:06
  • I really don't understand your cte and what you mean with it. I am sorry. And `RAND(CHECKSUM(NEWID())) ` returns NULL, if I substitute `NEWID()` with `0` I get a result. – Yana Dec 18 '20 at 15:10
  • 1
    @Yana . . . That expression should be returning a random number between 0 and 1. The db<>fiddle shows that in 1000 rows, it is not even generating a value less than 0.1. That is not very random. – Gordon Linoff Dec 18 '20 at 16:58
1

Your question can be simpliefied. The part:

CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))

can alo be written as:

CAST(SUM(CASE WHEN TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' THEN 1 ELSE 0 END) AS NUMERIC(25,2))

The second one is quicker then the first one, with the same result. (AFAIK)

This probably can also be applied to the other parts in the query.

Luuk
  • 12,245
  • 5
  • 22
  • 33