Questions tagged [data-profiling]

Data profiling is the process of examining the data available in an existing data source and collecting statistics and information about that data.

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.

Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data. Profiling helps to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata.

36 questions
12
votes
5 answers

Cannot start Concurrency Visualizer in Visual Studio 2012. Got error "Unable to start the ETW collection"

When I tried to profile a WPF application with Concurrency Visualzer (tried both launch and attach to process), I got the following error pop up - "Unable to start the ETW collection" ETW clearly means "Event Tracing for Windows", but I don't…
10
votes
2 answers

Data profiling Task - custom Profile Request

Is there any option to create a custom Profile Request for SSIS Data Profiling Task? At the moment there are 5 standard profile requests under SSIS Data Profiling task: Column Null Ratio Profile Request Column Statistics Profile Request Column…
Barsham
  • 749
  • 8
  • 30
4
votes
1 answer

MySQL capacity planning

In my production environment, I have a single instance of MySQL server running on 16gig of memory that handles up to 20,000 queries an hour. The size of one my table is growing at the rate of 2 million a per month. Both these numbers are expected to…
Dennis Y.
  • 135
  • 1
  • 7
3
votes
3 answers

Measuring peak disk use of a process

I am trying to benchmark a tool I'm developing in terms of time, memory, and disk use. I know /usr/bin/time gives me basically what I want for the first two, but for disk use I came to the conclusion I would have to roll my own bash script that…
roro
  • 177
  • 8
2
votes
1 answer

Using Pydequu on Jupyter Notebook and having this "An error occurred while calling o70.run.'

I'm trying to use Pydequu on Jupyter Notebook when i try to use ConstraintSuggestionRunner and show this error: Py4JJavaError: An error occurred while calling o70.run. : java.lang.NoSuchMethodError:…
2
votes
2 answers

How to detect and convert units of column values without using python loop?

As per my knowledge Python loops are slow, hence it is preferred to use pandas inbuilt functions. In my problem, one column will have different currencies, I need to convert them to dollar. How can I detect and convert them to dollar using pandas…
Kiran
  • 2,147
  • 6
  • 18
  • 35
2
votes
2 answers

How to loop through all tables and fields in each table to get percentage of missing values

I am trying to, using SSIS, obtain a table to get the percentage of missing values of every field in every table of a SQL Server database. Ideally I would like to create a new table in another database with 4 fields Table / Field / Type /…
fmarm
  • 4,209
  • 1
  • 17
  • 29
1
vote
2 answers

Data Profiling using Pyspark

I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. but I need a detailed report like unique_values…
1
vote
1 answer

Is it possible in snowflake to write a query that lists the columns that have all null values?

In snowsight within snowflake, you can profile tables and see the % of null values in the UI, but is there an easy way to query for this data or export it from the UI? I just need to create a new table off a table with 1k+, but exclude columns that…
0004
  • 1,156
  • 1
  • 14
  • 49
1
vote
1 answer

Not able to perform operations on resulting dataframe after "join" operation in PySpark

df=spark.read.csv('data.csv',header=True,inferSchema=True) rule_df=spark.read.csv('job_rules.csv',header=True) query_df=spark.read.csv('rules.csv',header=True) join_df=rule_df.join(query_df,rule_df.Rule==query_df.Rule,"inner").drop(rule_df.Rule).sho…
1
vote
0 answers

SSIS Data Profiling Task - Not showing all in Data Profile Outputs

Chose following request for Data Profiling Task in SSDT 2017. But, it's only showing NullRationReq in output and NOT the other requests. I tried few times, and when checked profiler output xml - in the DataProfileOutput node it only has…
BPen
  • 43
  • 1
  • 9
1
vote
0 answers

pandas-profiling "Duplicate rows" section is not showing-up in the HTML Report

I am using pandas-profiling=2.8.0 and I have generated an HTML report in which 2 duplicates are shown in the Overview Section, as seen below But the "Duplicate rows" option/section is missing in my HTML Report header. But in the shared example on…
PraveenS
  • 115
  • 13
1
vote
1 answer

why do I get IndexError while trying to get data profiling report?

I recently started using python. And, I am trying to get the report using pandas_profiling, I am running into IndexError. Can someone please explain how I can debug this? Data has like 30 variables and some 800,000 rows. So far I am trying to read a…
1
vote
0 answers

Error when running Data Profiling Task with Azure SQL Server data

When running a Data Profiling Task in SSIS with data from an Azure SQL Server, I receive the following error message: System.Data.SqlClient.SqlException (0x80131904): USE statement is not supported to switch between databases. Use a new connection…
1
vote
1 answer

Profiling the empty string in SSIS Data Profiling

I've just started using the Data Profiling Task in SSIS to profile some data on our databases. I've found the option for profiling the column null ratios ("Column Null Ratio Profiles") but I'm interested in profiling for the empty string ("") as…
t_warsop
  • 1,170
  • 2
  • 24
  • 38
1
2 3