Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
73
votes
3 answers

Find all columns of dataframe in Pandas whose type is float, or a particular type?

I have a dataframe, df, that has some columns of type float64, while the others are of object. Due to the mixed nature, I cannot use df.fillna('unknown') #getting error "ValueError: could not convert string to float:" as the error happened with…
Yu Shen
  • 2,770
  • 3
  • 33
  • 48
61
votes
7 answers

Python Pandas replace multiple columns zero to Nan

List with attributes of persons loaded into pandas dataframe df2. For cleanup I want to replace value zero (0 or '0') by np.nan. df2.dtypes ID object Name object Weight float64 Height …
Wouter Dunnes
  • 635
  • 1
  • 5
  • 10
60
votes
2 answers

Avoiding type conflicts with dplyr::case_when

I am trying to use dplyr::case_when within dplyr::mutate to create a new variable where I set some values to missing and recode other values simultaneously. However, if I try to set values to NA, I get an error saying that we cannot create the…
socialscientist
  • 3,759
  • 5
  • 23
  • 58
49
votes
4 answers

Python pandas groupby aggregate on multiple columns, then pivot

In Python, I have a pandas DataFrame similar to the following: Item | shop1 | shop2 | shop3 | Category ------------------------------------ Shoes| 45 | 50 | 53 | Clothes TV | 200 | 300 | 250 | Technology Book | 20 | 17 | 21 …
Davide Tamburrino
  • 581
  • 1
  • 5
  • 11
31
votes
4 answers

How to clear / maintain a django-sentry database?

I am using django-sentry to track errors in a website. My problem is that the database has grown too big. The 'message' table and the 'groupedmessage' are related Is there any way to clear older entries and specific messages or to add the sentry…
equalium
  • 1,241
  • 2
  • 12
  • 18
30
votes
3 answers

Fill in missing pandas data with previous non-missing value, grouped by key

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id'…
ChrisB
  • 4,628
  • 7
  • 29
  • 41
27
votes
3 answers

Removing non-English words from text using Python

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk. For example…
Andre Croucher
  • 395
  • 1
  • 3
  • 9
26
votes
6 answers

Python or awk/sed for cleaning data

I use R for data analysis and am very happy with it. Cleaning data could be a bit easier, however. I am thinking about learning another language suited to this task. Specifically, I am looking for a tool to use to take raw data, remove unnecessary…
Charlie
  • 2,801
  • 3
  • 26
  • 27
23
votes
2 answers

pandas.to_numeric - find out which string it was unable to parse

Applying pandas.to_numeric to a dataframe column which contains strings that represent numbers (and possibly other unparsable strings) results in an error message like…
clstaudt
  • 21,436
  • 45
  • 156
  • 239
22
votes
1 answer

modelform: override clean method

I have two questions concerning the clean method on a modelform. Here is my example: class AddProfileForm(ModelForm): ... password = forms.CharField(max_length=30,widget=forms.PasswordInput(attrs={'class':'form2'})) …
rom
  • 3,592
  • 7
  • 41
  • 71
15
votes
1 answer

removing stop words using spacy

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things: Tokenize Lemmantize Remove stop words import spacy nlp = spacy.load('en_core_web_sm', parser=False, entity=False) df['Tokens'] =…
Nelly Yuki
  • 399
  • 1
  • 4
  • 16
14
votes
1 answer

dplyr pipes - How to change the original dataframe

When I don't use a pipe, I can change the original daframe using this command df<-slice(df,-c(1:3))%>% # delete top 3 rows df<-select(df,-c(Col1,Col50,Col51)) # delete specific columns How would one do this with a pipe? I tried this but the slice…
Silver.Rainbow
  • 425
  • 4
  • 14
14
votes
5 answers

How do I clean twitter data in R?

I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,stripWhitespace, lazy=TRUE,…
kRazzy R
  • 1,561
  • 1
  • 16
  • 44
14
votes
4 answers

multi-column factorize in pandas

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to. I'd like to accomplish the equivalent of pandas.factorize on multiple columns: import pandas…
ChrisB
  • 4,628
  • 7
  • 29
  • 41
13
votes
1 answer

Pandas | Group by with all the values of the group as comma separated

As per application requirement, I need to show all the data which is part of group by in comma separated format so the admin can take decision, I am new to Python and not sure how to do it. Sample reproducible data import pandas as pd compnaies =…
Vineet
  • 1,492
  • 4
  • 17
  • 31
1
2 3
99 100