The process of taking raw data and parsing, filtering, extracting, organizing, combining, cleaning or otherwise converting it into a consistent useable form for further processing or input to an algorithm or system.
Questions tagged [data-munging]
236 questions
127
votes
8 answers
Good alternative to Pandas .append() method, now that it is being deprecated?
I use the following method a lot to append a single row to a dataframe. One thing I really like about it is that it allows you to append a simple dict object. For example:
# Creating an empty dataframe
df = pd.DataFrame(columns=['a', 'b'])
#…

Glenn
- 4,195
- 9
- 33
- 41
93
votes
3 answers
Pandas merge two dataframes with different columns
I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0…

economy
- 4,035
- 6
- 29
- 37
40
votes
10 answers
Strip white spaces from CSV file
I need to stripe the white spaces from a CSV file that I read
import csv
aList=[]
with open(self.filename, 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
aList.append(row)
# I need…

BAI
- 571
- 2
- 6
- 9
36
votes
6 answers
How to convert a python datetime.datetime to excel serial date number
I need to convert dates into Excel serial numbers for a data munging script I am writing. By playing with dates in my OpenOffice Calc workbook, I was able to deduce that '1-Jan 1899 00:00:00' maps to the number zero.
I wrote the following function…

Homunculus Reticulli
- 65,167
- 81
- 216
- 341
27
votes
10 answers
Python: in-memory object database which supports indexing?
I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.
For example, something like:
people = db([
{"name": "Joe", "age": 16},
…

David Wolever
- 148,955
- 89
- 346
- 502
13
votes
5 answers
pandas copy value from one column to another if condition is met
I have a dataframe:
df =
col1 col2 col3
1 2 3
1 4 6
3 7 2
I want to edit df, such that when the value of col1 is smaller than 2 , take the value from col3.
So I will get:
new_df =
col1 col2 col3
3 2 3
6 …

Cranjis
- 1,590
- 8
- 31
- 64
13
votes
2 answers
How to move my pandas dataframe to d3?
I am new to Python and have worked my way through a few books on it. Everything is great, except visualizations. I really dislike matplotlib and Bokeh requires too heavy of a stack.
The workflow I want is:
Data munging analysis using pandas in…

Anton
- 4,765
- 12
- 36
- 50
11
votes
2 answers
Which Perl modules for good for data munging?
Nine years ago when I started to parsing HTML and free text with Perl I read the classic Data Munging with Perl. Does someone know if David is planning to update the book or if there are similar books or web pages where the new parsing modules like…

Pablo Marin-Garcia
- 4,151
- 2
- 32
- 50
9
votes
2 answers
rm() function of r alternative in python
How to remove the variables in python to clear ram memory in python?
R :
a = 2
rm(a)
Python:
a = 2
How to clear the single variables or a group of variables?

koneru nikhil
- 339
- 2
- 12
6
votes
4 answers
openxlsx::write.xlsx overwriting existing worksheet instead append
The openxlsx::write.xlsx function is overwriting spreadsheet instead of adding another tab.
I tried do follow some orientations of Stackoverflow, but without sucess.
dt.escrita <- format(Sys.time(), '%Y%m%d%H%M%S')
write.xlsx( tbl.messages
…

Rafael Lima
- 420
- 1
- 5
- 16
6
votes
1 answer
Unexpected results of min() and max() methods of Pandas series made of Timestamp objects
I encountered this behaviour when doing basic data munging, like in this example:
In [55]: import pandas as pd
In [56]: import numpy as np
In [57]: rng = pd.date_range('1/1/2000', periods=10, freq='4h')
In [58]: lvls =…

LukaszJ
- 145
- 2
- 6
5
votes
3 answers
scripting with C#?
I have used Python extensively for doing various adhoc data munging and ancillary tasks. Since I am learning C#, I figure it would be fun to see if I can rewrite some of these scripts in C#.
Is there an executable available that takes a .cs file and…

voidstar
- 161
- 1
- 4
5
votes
5 answers
melt column by substring of the columns name in pandas (python)
I have dataframe:
subject A_target_word_gd A_target_word_fd B_target_word_gd B_target_word_fd subject_type
1 1 2 3 4 mild
2 …

Cranjis
- 1,590
- 8
- 31
- 64
5
votes
6 answers
How to do a sort of mixed values in R
I have a data frame that I want to sort by one column than the next, (using tidyverse if possible).
I checked the below address but the solutions did not seem to work.
Order a "mixed" vector (numbers with letters)
Sample code for an…

Jordan
- 1,415
- 3
- 18
- 44
5
votes
2 answers
Data munging in pandas
I have a CSV file with lines look like:
ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
I can read it in with
#!/usr/bin/env python
import pandas as pd
import sys
filename = sys.argv[1]
df = pd.read_csv(filename)
Given a…

Simd
- 19,447
- 42
- 136
- 271