Questions tagged [data-scrubbing]

The process of detecting and correcting (or removing) corrupt or inaccurate records from a data set

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant parts of the data and then resolving the issue by either replacing, modifying or deleting the errant data

http://en.wikipedia.org/wiki/Data_cleansing

65 questions
39
votes
3 answers

What is python's equivalent of R's NA?

What is python's equivalent of R's NA? To be more specific: R has NaN, NA, NULL, Inf and -Inf. NA is generally used when there is missing data. What is python's equivalent? How libraries such as numpy and pandas handle missing values? How does…
power
  • 1,680
  • 3
  • 18
  • 30
25
votes
3 answers

Anonymizing customer data for development or testing

I need to take production data with real customer info (names, address, phone numbers, etc) and move it into a dev environment, but I'd like to remove any semblance of real customer info. Some of the answers to this question can help me generating…
BradC
  • 39,306
  • 13
  • 73
  • 89
3
votes
3 answers

Ceph PGs not deep scrubbed in time keep increasing

I've noticed this about 4 days ago and dont know what to do right now. The problem is as follows: I have a 6 node 3 monitor ceph cluster with 84 osds, 72x7200rpm spin disks and 12xnvme ssds for journaling. Every value for scrub configurations are…
Nyquillus
  • 179
  • 1
  • 5
  • 23
3
votes
1 answer

Check for typos comparing two strings in T-SQL

We have developed a series of business rules that determines a duplicate contact record, the basis of these rules are centred around first checking for the same name then comparing other fields like phone number, email, phone, etc. The problem is…
Benzine
  • 472
  • 1
  • 5
  • 19
2
votes
2 answers

Staging step in Data Warehousing?

How do usually people perform staging step in Data warehousing?? I have to do a similar task and I am not sure if using a NoSQL Database would be a good option for data integration purposes and how much easy and efficient would it be to perform…
daydreamer
  • 87,243
  • 191
  • 450
  • 722
2
votes
1 answer

How to read data from a PDF using SAS Program

Problem Statement: I am unable to read data from a PDF file using SAS. What worked well: I am able to download the PDF from the website and save it. Not working (Need Help): I am not able to read the data from a PDF file using SAS. The source…
anil kumar
  • 41
  • 5
2
votes
3 answers

How to extract user ratings from a movie dataset

This screenshot is the sample of the merged movielens dataset, I have two questions: If I want to extract only user 191 movieid, title, genres and ratings alone, how will I do this? How can I list out only the years at the end of each movie…
2
votes
1 answer

How to slice a part of a string in DF when you don't know exact position?

I'm struggling with slicing. I thought that generally it's easy and I understand it but when it comes to the below situation my ideas don't work. Situation: In one of my columns in DF I want to remove in all rows some string that sometimes occurs…
QbS
  • 425
  • 1
  • 4
  • 17
2
votes
1 answer

Pandas: How can I convert 'timestamp' values in my dataframe column from object/str to timestamp?

My timestamp looks like below in the dataframe of my column but it is in 'object'. I want to convert this into 'timestamp'. How can I convert all values such in my dataframe column into timestamp? 0 01/Jul/1995:00:00:01 1 …
jubins
  • 317
  • 2
  • 7
  • 18
2
votes
2 answers

Anonymize names in paragraph variable by matching and replacement

I am analyzing a school's student report card database. My dataset consists of around 3000 records structured similarly to the example below. Each observation is one teacher's assessment of one student. Each observation contains a three-sentence…
Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
2
votes
3 answers

Need Better Algorithm to Scrub SQL Server Table with Java

I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records). My table has 4 fields: id1 id2 val1 val2 For every group of records with the same "id1", I need to keep the…
2
votes
4 answers

How do you scrub a List for only matching strings?

I am trying to create a routine that takes a List from a textBox and then scrubs it using another List. Only strings with the matching text will remain. I don't think I can use RegEx, because I don't know what the scrub list will consist of. The…
Jeagr
  • 1,038
  • 6
  • 16
  • 30
1
vote
1 answer

Pandas calculating time deltas from index

I have a months time series data that I am trying calculate total hours, minutes, seconds in the dataset as well as for a unique Boolean column for when the column is True or a 1. And for some reason I am doing something wrong where the total time…
bbartling
  • 3,288
  • 9
  • 43
  • 88
1
vote
1 answer

How to use BeautifulSoup to find specific class elements on a web page

Goal: To perform a web search that looks up a business and from the results, looks for either a "Permanently Closed" text or "Open" with hours or basically anything BUT "Permanently closed." Problem: I'm using BeautifulSoup to parse the search…
spareTimeCoder
  • 212
  • 2
  • 12
1
vote
0 answers

How to extract table and text from docx?

I am working on extracting text and tables from Docx files using pydocx library. I have to extract text and tables separately from the doc file that is creating issue of linking tabular data with text content. e.g. text There are two types of…
Hamza Shaikh
  • 75
  • 1
  • 8
1
2 3 4 5