0

So I want to count the number of data points plotted on my graph to keep a total track of graphed data. The problem is, my data table messes it up to where there are some NaN values in a different row in comparison to another column where it may or may not have a NaN value. For example:

# I use num1 as my y-coordinate and num1-num2 for my x-coordinate.
num1 num2 num3 
1    NaN  25 
NaN  7    45
3    8    63
NaN  NaN  23
5    10   42
NaN  4    44

#So in this case, there should be only 2 data point on the graph between num1 and num2. For num1 and num3, there should be 3. There should be 4 data points between num2 and num3.

I believe Matplotlib doesn't graph the rows of the column that contain NaN values since its null (please correct me if I'm wrong, I can only tell this due to no dots being on the 0 coordinate of the x and y axes). In the beginning, I thought I could get away with using .count() and find the smaller of the two columns and use that as my tracker, but realistically that won't work as shown in my example above because it can be even LESS than that since one may have the NaN value and the other will have an actual value. Some examples of code I did:

# both x and y are columns within the DataFrame and are used to "count" how many data points are # being graphed.
def findAmountOfDataPoints(colA, colB):
    if colA.count() < colB.count():
         print(colA.count())           # Since its a smaller value, print the number of values in colA.
    else: 
         print(colB.count())              # Since its a smaller value, print the number of values in colB.

Also, I thought about using .value_count() but I'm not sure if thats the exact function I'm looking for to complete what I want. Any suggestions?

Edit 1: Changed Data Frame names to make example clearer hopefully.

researchnewbie
  • 100
  • 1
  • 10
  • @TrentonMcKinney The problem with that is explained in my example. I could just find the lowest of the two columns, however, there may be a NaN value in the other column like shown in my example that won't be graphed. I'll change my example to show that. – researchnewbie Oct 09 '19 at 05:32
  • @TrentonMcKinney Okay I updated my example to explain how that won't work. See how num1 has 3 values that are not NaN and num2 has 4 values that are not NaN? Now, when it comes to graphing those two together, only row 3 and 5 (2 and 4 if you want it to start at 0) will be graphed. Therefore, there is only 2 data points on the graph and that can't be found based on using df.count(). – researchnewbie Oct 09 '19 at 05:35
  • @TrentonMcKinney My apologies, I left that to show that I attempted the problem and didn't get the result I wanted. So, does df.dropna() completely disregard the entire row in the data frame from what I'm getting? I still want to keep that row in my data frame because I have a large column x row that I want to still keep for other graphs. – researchnewbie Oct 09 '19 at 05:44

2 Answers2

1

If I understood correctly your problem, assuming that your table is a pandas dataframe df, the following code should work:

sum((~np.isnan(df['num1']) & (~np.isnan(df['num2']))))

How it works:

np.isnan returns True if a cell is Nan. ~np.isnan is the inverse, hence it returns True when it's not Nan.

The code checks where both the column "num1" AND the column "num2" contain a non-Nan value, in other words it returns True for those rows where both the values exist.

Finally, those good rows are counted with sum, which takes into account only True values.

Giallo
  • 96
  • 4
  • Man, thats a great way to check. Theoretically it makes sense too. So does np.isnan() go through each cell every time? – researchnewbie Oct 09 '19 at 05:38
  • In a way, yes. np.isnan() checks the input array (in this case a dataframe's column) and returns a boolean array of the same shape with only True (for those cells that were Nan in the input array) and False ( for those cells that weren't Nan). If you use ~np.isnan it will be the opposite. – Giallo Oct 09 '19 at 05:50
0

The way I understood it is that the number of combiniations of points that are not NaN is needed. Using a function I found I came up with this:

import pandas as pd
import numpy as np

def choose(n, k):
    """
    A fast way to calculate binomial coefficients by Andrew Dalke (contrib).
    https://stackoverflow.com/questions/3025162/statistics-combinations-in-python
    """
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in range(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        return ntok // ktok
    else:
        return 0


data = {'num1': [1, np.nan,3,np.nan,5,np.nan],
        'num2': [np.nan,7,8,np.nan,10,4],
        'num3': [25,45,63,23,42,44]
        }

df = pd.DataFrame(data)

df['notnulls'] = df.notnull().sum(axis=1)

df['plotted'] = df.apply(lambda row: choose(int(row.notnulls), 2), axis=1)
print(df)
print("Total data points: ", df['plotted'].sum())

With this result:

   num1  num2  num3  notnulls  plotted
0   1.0   NaN    25         2        1
1   NaN   7.0    45         2        1
2   3.0   8.0    63         3        3
3   NaN   NaN    23         1        0
4   5.0  10.0    42         3        3
5   NaN   4.0    44         2        1
Total data points:  9
Oleg
  • 303
  • 2
  • 14