0

I applied normalization on multiple columns in Pandas dataframe by using for-loop under the condition of below:

Normalization for A , B columns between : [-1 , +1]

Normalization for C column between : [-40 , +150]

and replace results in alternative dataframe let's call norm_data and store it as a csv file.

my data is txt file dataset

# Import and call the needed libraries
import numpy as np
import pandas as pd

#Normalizing Formula

def normalize(value, min_value, max_value, min_norm, max_norm):
    new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value

#Split data in three different lists A, B and C

df1 = pd.read_csv('D:\me4.TXT', header=None)
id_set = df1[df1.index % 4 == 0].astype('int').values
A = df1[df1.index % 4 == 1].values
B = df1[df1.index % 4 == 2].values
C = df1[df1.index % 4 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]} # arrays
#df contains all the data
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0]) 
df2 = pd.DataFrame(data, index= id_set[0:])
print(df)

#--------------------------------
cycles = int(len(df)/480)
print(cycles)

#next iteration create all plots, change the numer of cycles
for i in df:
    min_val = df[i].min()
    max_val = df[i].max()
    if i=='C':
        #Applying normalization for C between [-40,+150]
        data['C'] = normalize(df[i].values, min_val, max_val, -40, 150)
    elif i=='A':
        #Applying normalization for A , B between [-1,+1]
        data['A'] = normalize(df[i].values, min_val, max_val, -1, 1)
    else:
        data['B'] = normalize(df[i].values, min_val, max_val, -1, 1)


norm_data = pd.DataFrame(data)
print(norm_data)
norm_data.to_csv('norm.csv')
df2.to_csv('my_file.csv')
print(df2)

Problem is after normalization by help of @Lucas I've missed my index was labeled id_set.

So far I got below output in my_file.csv including this error TypeError unsupported format string passed to numpy.ndarray.__format__:

id_set         A         B           C
['0']      2.291171  -2.689658  -344.047912
['10']     2.176816  -4.381186  -335.936524
['20']     2.291171  -2.589725  -342.544885
['30']     2.176597  -6.360999     0.000000
['40']     2.577268  -1.993412  -344.326376
['50']     9.844076  -2.690917  -346.125859
['60']     2.061782  -2.889378  -346.378859
['70']     2.348300  -2.789547  -347.980986
['80']     6.973350  -1.893454  -337.884738
['90']     2.520040  -3.087004  -349.209006

which those [''] are unwanted! my desired output should be like below after normalization :

id_set     A         B           C
000   -0.716746  0.158663  112.403310
010   -0.726023  0.037448  113.289702
020   -0.716746  0.165824  112.567557
030   -0.726040 -0.104426  150.000000
040   -0.693538  0.208556  112.372881
050   -0.104061  0.158573  112.176238
060   -0.735354  0.144351  112.148590
070   -0.712112  0.151505  111.973514
080   -0.336932  0.215719  113.076807
090   -0.698181  0.130189  111.839319
010    0.068357 -0.019388  114.346421
011    0.022007  0.165824  112.381444

Any ideas would be welcome since it's important data for me.

Mario
  • 1,631
  • 2
  • 21
  • 51

1 Answers1

0

if I understand you correctly, my_file.csv / df2 should look like the lower output from your question? Then I believe you just have a typo in your construction of df2, you want the index to look the same as df1, so:

df2 = pd.DataFrame(data, index = id_set[:,0])

instead of

df2 = pd.DataFrame(data, index= id_set[0:])

(notice the contents of the square brackets). This will make your output file my_file.csv look like this:

,A,B,C
0,2.19117130798,-2.5897247305,-342.54488522400004
10,2.19117130798,-4.3811855641,-335.936524309
20,2.19117130798,-2.5897247305,-342.54488522400004
...

While your output file norm.csv looks like this:

,A,B,C
0,-1.0,0.16582420581574775,145.05394742081884
1,-1.0,0.037447604422215175,145.9298596578588
2,-1.0,0.16582420581574775,145.05394742081884
...

If you want your output file norm.csv to have the same index (0,10,20 instead of 0,1,2...) you need to define norm_data as

norm_data = pd.DataFrame(data, index = id_set[:,0])

instead of

norm_data = pd.DataFrame(data)

Also, I should note that your data contains a couple of NaN/inf entries, which mess up your normalization.

You can replace those using

df = df.replace(np.inf, np.nan)
df = df.fillna(0)

(credit to this question/answer), using the same for df2. You can also replace the NaN/inf entries with other values using the same functions.

Freya W
  • 487
  • 3
  • 11
  • Thanks a million for your accuracy to find my typo which resulted in index mess-up and thanks a to notice missing data in my dataset which are located in last 5 lines of dataset in txt file. finally I used `df = df.replace([np.inf, -np.inf], np.nan).astype(np.float64)` just in case and 'df = df.fillna(0)' but my only concern is what if I have value like **0.0** in dataset? replacements of missing data by **0** wouldn't interfere and make conflicts in final result? or such a these type of replacements technique is the only trick? Is there any elegant way to replace by another number `0.01234` – Mario Jan 22 '19 at 16:31
  • I was wondering if you could find a solution for my other question [here](https://stackoverflow.com/questions/54282812/how-can-make-subplots-of-columns-in-pandas-dataframe-in-one-window-inside-of-for) regarding `dataframe` and `subplotting` and make me happy . Happy to find such accurate people here in SO – Mario Jan 22 '19 at 16:42
  • @Mario sure you can, just write `df.fillna(0.01234)`. You'll know best which numbers will least affect your dataset. If the answer was helpful, be sure to choose it as "accepted answer" – Freya W Jan 22 '19 at 22:50
  • I admit your answer and I just was waiting for your comment. Would appreciate it if you take a look my other question since it's highly important with me and I think you can solve it [here](https://stackoverflow.com/questions/54282812/how-can-make-subplots-of-columns-in-pandas-dataframe-in-one-window-inside-of-for) BTW how can I check in my dataset whether I have **0.0** or not? Any ideas? – Mario Jan 22 '19 at 22:54
  • @Mario, as for your other question, I'll have a proper look tomorrow, but you can get rid of the key error by just leaving out the [i].values in the normalize call, so `new_value3 = normalize(df['C'].iloc[j:j+480], min_val, max_val, -40, 150)`, same for A and B – Freya W Jan 22 '19 at 23:10
  • @Mario, the sample file you provided in your other only has one cycle, not three, which makes it difficult to replicate what you are trying to do. You might just have an indentation error, where the code after `#plotting all columns ['A','B','C'] in-one-window side by side` is still in the `for i in df:` loop, I don't know if that is intended. Also could you recheck if you [marked the answer as accepted](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work), as without 50 reputation I can't use the comments which would be helpful to answer other questions. – Freya W Jan 22 '19 at 23:25
  • Thanks for your consideration. I hope tomorrow you could help me out by a solution. I just updated dataset with 3 cycles, so that it's possible to increase the (1) to (2) or (3) to check it out for next cycles `for cycle in range(1):`. As you mentioned I removed `[i]` and got rid of **Key errors** but still there would be another error about not defining `df3` and `new_value3` which are weird since after normalizing I defined for `'C'`, `'A'`, `'B'` respectively. The reason I haven't marked the answer is just because I posted it yesterday and someone might have another interesting answer! – Mario Jan 23 '19 at 03:22
  • @Mario I'm still working on getting enough reputation to be able to comment on all posts, if it was helpful, could you mark this as your accepted answer? – Freya W Jan 25 '19 at 09:19
  • ofc It was helpful, I'm so happy I've got in touch with you and I'm wondering if it's possible to stay in touch with you via e-mail? Since you're experienced person and I've got some questions regarding my scripts which are related to **Machine Learning** it would be great idea and I will keep asking here to learn more from skillful people like you. I believe that my questions for you would be source of collecting reputation for you in SO hopefully other people also can take benefit of'em. Happy to be here in SO. I leave mine plz shoot me an e-mail: **clevilll@yahoo.com** – Mario Jan 26 '19 at 19:42