0

I have a dataframe with 2 columns (time and pressure).

timestep value
    0    393
    1    389
    2    402
    3    408
    4    413
    5    463
    6    471
    7    488
    8    422
    9    404
    10   370

I first need to find the frequency of each pressure value and rank them df['freq_rank'] which works fine, but when I am trying to mask the dataframe by comparing the column against count value & find interval difference, I am getting NaN results..

import numpy as np
import pandas as pd
from matplotlib.pylab import *
import re
import pylab
from pylab import *
import datetime
from scipy import stats
import matplotlib.pyplot

df = pd.read_csv('copy.csv')
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]

## Timestep as int
df = pd.DataFrame({'timestep':np.arange(3284), 'value': df.Pressure})

## Rank of the frequency of each value in the df
vcs = {v: i for i, v in enumerate(df.value.value_counts().index)}
df['freq_rank'] = df.value.apply(vcs.get)
print(df.freq_rank)


>>Output:
>>0    131
>>1    235
>>2     99
>>3     99
>>4    101
>>5    101
>>6    131
>>7     79
>>8     79



## Find most frequent value
count = df['value'].value_counts().sort_values(ascending=[False]).nlargest(10).index.values[0] 

## Mask the DF by comparing the column against count value & find interval diff.
x = df.loc[df['value'] == count, 'timestep'].diff()
print(x)

>>Output:
>>50        1.0
>>112      62.0
>>215     103.0
>>265      50.0
>>276      11.0
>>277       1.0
>>278       1.0
>>318      40.0
>>366      48.0
>>367       1.0
>>368       1.0
>>372       4.0

df['freq'] = df.value.apply(x.get)
print(df.freq)

>>Output:
>>0    NaN
>>1    NaN
>>2    NaN
>>3    NaN
>>4    NaN
>>5    NaN
>>6    NaN
>>7    NaN
>>8    NaN

I don't understand why print(x) returns the right output and print(df['freq']) returns NaN.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
joasa
  • 946
  • 4
  • 15
  • 35
  • 1
    Can you create a [mcve](http://stackoverflow.com/help/mcve) please? See [how to create good reproducible pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Julien Marrec Dec 09 '16 at 12:56
  • What further information do you need? I have also included a piece of my dataframe. – joasa Dec 09 '16 at 13:01

1 Answers1

1

I think your problem is with the last statement df['freq'] = df.value.apply(x.get)

If you just want to copy the x to the new column df['freq'] you can just:

df['freq'] = x

Then print(df.freq) will give you the same results as your print(x) statement.


Update: Your problem is with the indicies. df only has index values from 0-10 where as your x has 50, 112, 215... When assigning to df, only values that has an existing index is added.

wonderkid2
  • 4,654
  • 1
  • 20
  • 20
  • I have tried that. Even if I do `df['freq'] = x`, when I try `print(df)` or `print(df.freq)` I still see NaN values – joasa Dec 09 '16 at 13:10
  • What does `print(x)` give you? – wonderkid2 Dec 09 '16 at 14:15
  • You can see that in the question – joasa Dec 09 '16 at 14:19
  • See my answers update. I think the problem is with the index. – wonderkid2 Dec 12 '16 at 09:24
  • You're right, that's the problem. But how can I solve it? – joasa Dec 12 '16 at 09:40
  • Depends on what you want to achieve I guess. You can use the `reset_index` function to make the index of `x` match `df`, but that would assign the values from `x` to just the first 10 values of `df`. You can also remove the `nlargest(10)` function and get all the values, for all the indices of `df`. – wonderkid2 Dec 12 '16 at 10:32
  • What I want to achieve is to create a column in my df, called 'freq' , the same way i created columns 'value','timestep' and 'freq_rank'. And every time I run df, to be able to see these 4 columns. The `nlargest(10)` i need it in order to find the most frequent values. – joasa Dec 12 '16 at 10:57
  • You're only calculating the frequency for the 10 largest frequencies, so only 10 rows in your dataframe are going to have values in the `freq` column. If you need the frequency to be calculated for all rows in the dataframe you need to remove the `nlargest(10)` function call. – wonderkid2 Dec 12 '16 at 13:28
  • But if I remove the `nlargest(10)`, then I will not be able to find the most frequent value. I changed it to `nlargest()`, and the result is still NaN – joasa Dec 12 '16 at 15:04
  • `nlargest()` defaults to 5 so removing the argument it won't change anything. Remove the function call all together and then select the top frequencies once all the frequencies has been added to the dataframe. `df['freq'].nlargest(10)`. – wonderkid2 Dec 14 '16 at 12:51