Data munging in pandas

Question

I have a CSV file with lines look like:

ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,

I can read it in with

#!/usr/bin/env python

import pandas as pd
import sys

filename = sys.argv[1]
df = pd.read_csv(filename)

Given a particular column, I would like to split the rows by ID and then output the mean and standard deviation for each ID.

My first problem is, how can I remove all the non-numeric parts from the numbers such as "100M" and "0N#" which should be 100 and 0 respectively.

I also tried looping over the relevant headers and using

df[header].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

as suggested in Pandas DataFrame: remove unwanted parts from strings in a column .

However this changes 98.4 into 984.

Possible duplicate of [Pandas DataFrame: remove unwanted parts from strings in a column](http://stackoverflow.com/questions/13682044/pandas-dataframe-remove-unwanted-parts-from-strings-in-a-column) — , Nov 13 '15 at 09:47
@Evert Added an edit to show how I need to cope with decimal points too. — Simd, Nov 13 '15 at 10:05
The duplicate I listed suggest to use `rstrip`, with the characters you'd like to remove. So ...`rstrip('MN#')` should work nicely for your input, using the lambda suggested in the duplicate (of course, you can leave out the `lstrip` part). — , Nov 13 '15 at 10:14
@Evert The problem is that MN# are not the only non-numeric characters. I suppose I would have to do lstrip to as well? — Simd, Nov 13 '15 at 10:16
@Evert Also my looping over the headers is not working. `for header in list(df)[9:-3]:` gives me the columns not the headers. What is the right way to do that? — Simd, Nov 13 '15 at 10:17
@eleanora Can you give a full list of non-numeric characters you'd expect or do you just want to strip any characters that aren't numeric? In either case, you can still use `rstrip`, you just need to pass a string with all the characters to be removed. — SuperBiasedMan, Nov 13 '15 at 10:30

EdChum · Accepted Answer · 2015-11-13T10:47:42.100

use str.extract:

In [356]:
import io
import pandas as pd
t="""ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df

Out[356]:
   0     1     2    3    4    5     6    7    8     9    10    11    12   13  \
0  ID  98.4  100M  55M  65M  75M  100M  75M  65M  100M  98M  100M  100M  92M   

   14   15  
0  0#  0N#  

In [357]:
for col in df.columns[2:]:
    df[col] = df[col].str.extract(r'(\d+)').astype(int)
df

Out[357]:
   0     1    2   3   4   5    6   7   8    9   10   11   12  13  14  15
0  ID  98.4  100  55  65  75  100  75  65  100  98  100  100  92   0   0

If you have float numbers then you can use the following regex:

In [379]:
t="""ID,98.4,100.50M,55.234M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df

Out[379]:
   0     1        2        3    4    5     6    7    8     9    10    11  \
0  ID  98.4  100.50M  55.234M  65M  75M  100M  75M  65M  100M  98M  100M   

     12   13  14   15  
0  100M  92M  0#  0N#  

In [380]:    
for col in df.columns[2:]:
    df[col] = df[col].str.extract(r'(\d+\.?\d+)').astype(np.float)
df

Out[380]:
   0     1      2       3   4   5    6   7   8    9   10   11   12  13  14  15
0  ID  98.4  100.5  55.234  65  75  100  75  65  100  98  100  100  92 NaN NaN

so (\d+\.?\d+) looks for groups containing \d+ 1 or more digits with \.? optional decimal point and \d+ 1 or more further digits after the decimal point

EDIT

OK edited my regex pattern:

In [408]:
t="""Name,97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df

Out[408]:
     0     1   2   3    4   5     6   7    8     9    10   11   12   13  14  \
0  Name  97.7  0A  0A  65M  0A  100M  5M  75M  100M  90M  90M  99M  90M  0#   

    15  
0  0N#  

In [409]:    
for col in df.columns[2:]:
    df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df

Out[409]:
     0     1   2   3   4   5    6   7   8    9   10  11  12  13  14  15
0  Name  97.7   0   0  65   0  100   5  75  100  90  90  99  90   0   0

df = pd.read_csv(io.StringIO(t), header=None) TypeError: initial_value must be unicode or None, not str — Simd, Nov 13 '15 at 10:38
Ignore the `io.StringIO` bit this is just me reading in your text as a file object in your case `df=pd.read_csv(filename)` is fine — EdChum, Nov 13 '15 at 10:39
If this is the whole file "Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#," Your code gives me "0 Name NaN NaN NaN 65 NaN 100 NaN 75 100 90 90 99 90 NaN NaN NaN" — Simd, Nov 13 '15 at 10:42
The edited version is better and gives `0 Name NaN 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN`. This still kills the 97.7 at the start. — Simd, Nov 13 '15 at 10:49
I don't see that, can you post raw input data and code into your question — EdChum, Nov 13 '15 at 10:50
It's unnecessary to parse the `97.7` in this case as pandas will set the `dtype` to `float64` which is why I indexed from `2:` onwards — EdChum, Nov 13 '15 at 10:55
if it was `97.7M` then my code will work, the problem is that calling `.str.extract` will raise an error as this only works for `str` columns and not numeric ones — EdChum, Nov 13 '15 at 10:59

miraculixx · Answer 2 · 2016-07-28T06:43:21.997

My first problem is, how can I remove all the non-numeric parts from the numbers such as "100M" and "0N#" which should be 100 and 0 respectively.

import re
df = pd.read_csv(yourfile, header=None)
df.columns = ['ID'] + list(df.columns)[1:]
df = df.stack().apply(lambda v: re.sub('[^0-9]','', v) 
                 if isinstance(v, str) else v).astype(float).unstack()
df.groupby('ID').agg(['std', 'mean'])

Here .stack() transforms the dataframe into a Series, .apply() calls the lambda for each value, re.sub() removes any non-numeric characters, .astype() converts to numeric and unstack() transforms the Series back into a dataframe. This works equally well for both integers and floating point numbers.

Given a particular column, I would like to split the rows by ID and then output the mean and standard deviation for each ID.

 # for all columns
 df.groupby('ID').agg(['std', 'mean'])
 # for specific column
 df.groupby('ID')['<colname>'].agg(['std', 'mean'])

Here's the data used in the example:

from StringIO import StringIO
s="""
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
"""
yourfile = StringIO(s)

Data munging in pandas

2 Answers2