0

I have a dataframe extracted with Pandas for which one of the colums looks something like this:

Image: A few elements of the column of interest of the Dataframe

What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.

Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.

for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])

Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.

However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

Although I don't see any issue happening with my data, is this warning important?

Miguel
  • 96
  • 8

2 Answers2

0

I think you can benefit from the Pandas vectorized operations here. Use findall over the original dataframe and apply in sequence the pd.Series to transform from list to columns and pd.to_numeric to convert from string to numeric type (default return dtype is float64). Then calculate the average of the values on each row with .mean(axis=1).

import pandas as pd

d = {0: {0: '2.469 (VLT: emission host)',
  1: '1.942 (VLT: absorption)',
  2: '1.1715 (VLT: absorption)',
  3: '0.42 (NOT: absorption)|0.4245 (GTC)|0.4250 (ESO-VLT UT2: absorption & emission)',
  4: '3.3765 (VLT: absorption)',
  5: '1.86 (Xinglong: absorption)| 1.86 (GMG: absorption)|1.859 (VLT: absorption)',
  6: '<2.4 (NOT: inferred)'}}

df = pd.DataFrame(d)
print(df)

s_mean = df[0].str.findall(r'(?:\b\d{1,2}\b(?:\.\d*))')\
        .apply(pd.Series)\
        .apply(pd.to_numeric)\
        .mean(axis=1)

print(s_mean)

Output from s_mean

0    2.469000
1    1.942000
2    1.171500
3    0.423167
4    3.376500
5    1.859667
6    2.400000
n1colas.m
  • 3,863
  • 4
  • 15
  • 28
  • Your solution does not give a warning as output, but it generates an error later in my code, which does not happen if I use my solution: `ValueError: could not convert string to float: '~6.0 (estimate, photometric)'`. This occurs when using `pd.cut` to bin the dataframe based on the values of this column. The issue is with the cells in which the original string contains the character `~` just before the float, with no space in between. However, it is strange that in this column of the dataframe we can only see the numbers (in the case of the example it shows a `6`) and dtypes is float. – Miguel Feb 16 '21 at 15:25
0

I have found a solution based on what I wrote previously in the Edit of the original post:

It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:

statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])

Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:

import re
import pandas as pd
import statistics

def redshift_to_num(string):
    measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
    mean=statistics.mean(measures)
    return mean

df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))

Notes:

  • The data of interest in my case is stored in the dataframe column df.Redshift.
  • In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
Miguel
  • 96
  • 8