0

I was wondering if we can count how often 0-9 repeat in leftmost digit of index of certain panda dataframe :

             A          B           C
0   -56.343656        NaN -418.540483
10  -87.577880 -16.061497         NaN
20         NaN -15.337254         NaN
40  -67.462841        NaN -431.924830
50  -63.377158 -28.260790         NaN
60         NaN -22.996095         NaN
130  11.569845        NaN -307.034737
180  11.398947  -1.793530         NaN

I've extracted the indexes of those columns have nan including leading 0:

000
010
020
040
050
060
130
180

and try to store extracted indexes in csv file. then I even tried to put them in data frame again base this principle leftmost digit could be [0-7] the 2nd leftmost digit could be [0-59] and the last one could be [0-9999] and store in in csv file to further process on just 'section' column which represents leftmost digit column.

My scripts are following:

import numpy as np
import pandas as pd


df = pd.read_csv('D:\SOF.TXT', header=None)
id_set = df[df.index % 4 == 0].astype('int').values
A = df[df.index % 4 == 1].values
B = df[df.index % 4 == 2].values
C = df[df.index % 4 == 3].values

data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
#main_data contains all the data
main_data = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])  


main_data[np.isinf(main_data)] = np.nan    # convert inf to nan
main_data_nan = main_data[main_data.isnull().any(axis=1)]   # extract sub data frame
print(main_data_nan)
# to fix 3 digits of index in start
new_index = [str(x).zfill(3) for x in main_data_nan.index]
main_data_nan.index = new_index
#print all data includes nan values in .csv file
main_data_nan.to_csv('nan_data.csv', na_rep='NaN')   # export
#print just especial column that includes nan values in .csv file including id_set or indexes
main_data_nan['C'].to_csv('nan_datatemp.csv', na_rep='NaN')
#print all id_set which is index in data frame has nan values
for i in range(len(main_data_nan)):
    print (main_data_nan.index[i])


dff = pd.read_csv("D:\nan_datatemp.csv")
cycle, section, cell = [], [], []
for i in range(9999):
    for j in range(8):
        for k in range(60):
            cycle.append(i)
            section.append(j)
            cell.append(k)
dfff = {'Section':section, 'Cell':cell, 'Cycle':cycle}
dffff = pd.DataFrame(dfff, columns=['Section','Cell', 'Cycle'], index = id_set[:,0])
dffff.to_csv('exit_id_det.csv', encoding='utf-8', index=False)

I'm not sure the answer of here could be fulfill my answer by applying (df==X).sum() on leftmost digit index like:

(df==0).sum()
(df==1).sum()
(df==2).sum()
(df==3).sum()
(df==4).sum()
(df==5).sum()
(df==6).sum()
(df==7).sum()

and even using main_data_nan.isnull().sum().sum() computing percentage of frequency occurrence of them My desire results should be like:

index ----> frequency ---> percentage
000 ----> 0: 6 times ----> 0: 75% in total
010 ----> 1: 2 times ----> 1: 25% in total
020 ----> 2: 0 times ----> 2: 0% in total
040 ----> 3: 0 times ----> 3: 0% in total
050 ----> 4: 0 times ----> 4: 0% in total
060 ----> 5: 0 times ----> 5: 0% in total
130 ----> 6: 0 times ----> 6: 0% in total
180 ----> 7: 0 times ----> 7: 0% in total

Here below is my dataset sample: dataset sample DL link

Mario
  • 1,631
  • 2
  • 21
  • 51
  • Could you clarify your question and include an example of what you want exactly? I'm a bit confused. – Joe Patten Jan 17 '19 at 04:02
  • @JoePatten I'm interested in **frequency occurrence** of leftmost digit of extracted `index` from those columns that have NaN in panda dataframe eg. **0**00 ,**0**10 ,**0**20 ,**0**40 ,**0**50,**0**60,**1**30,,**1**80 **0**: 6 times 75% in total and **1**: 2 times 25% in total – Mario Jan 17 '19 at 11:36
  • So I'm assuming that the indexes < 1000, so I won't be seeing an index like 2010? – Joe Patten Jan 17 '19 at 11:43
  • @JoePatten My case **indexes format** could be maximum [0-7] [0-59] [0-9999] it means I could have : `0` 29 3333 , `7` 59 9999 but I'm interested in `leftmost` digit ones but if you mean **numbers of indexes** would be < 1000? it's is not right perhaps I have 2000 NaN values in dataset which their indexes total numbers would be 2000 – Mario Jan 17 '19 at 12:14
  • Are you trying to extract the left most digit for each of those sections? (in this case section, cell and cycle)? ]For example if you have the number `006888` then `section=0`, `cell=0`, `cycle=8`? – Joe Patten Jan 17 '19 at 12:23
  • exactly but in your example would be `section=0` `cell=0` `cycle=6888` as I mentioned left most digit could be [0-7] as Section, 2nd leftmost digit could be [0-59] as cell and finally the last leftmost digit could be [0-9999] as cycle, your assumption was right for this index `068` actually as long as they are single number like 0 6 8 they'll be shown as `068` I mean we don't have this `0 06 08`. To sum it up 1st leftmost digit is always single digit, 2nd leftmost digit could be double digit after passed 9 and turn over to 10, 11, ..,59. The last digit could be increased to 4 digit too. – Mario Jan 17 '19 at 14:02

1 Answers1

0

Make a new column named index from index, then find condense your dataframe so you have only columns observations that contain null values. Then count the left most digit for each section.

import collections

df['index'] = df.index
null_data = df[df.isnull().any(axis=1)]
cycle_left = collections.Counter(null_data['index']//100%10)
cell_left = collections.Counter(null_data['index']//10000%10)
section_left = collections.Counter(null_data['index']//100000%10)

The output will be a dictionary where the keys are the left most digits, and the values are the frequencies for each digit.

Now, let's assume that you have null values at indexes 427888, 101, 6123456.

Example Output:

Counter({1: 1, 8: 1, 4: 1})     #cycle_left
Counter({2: 2, 0: 1})           #cell_left
Counter({0: 1, 4: 1, 1: 1})     #section_left

You could also use .value_counts() in the last step to get a series containing frequencies (the index would be the left most digit and the value would be the frequency for that digit).

df['index'] = df.index
null_data = df[df.isnull().any(axis=1)]
cycle_left = (null_data['index']//100%10).value_counts()
cell_left = (null_data['index']//10000%10).value_counts()
section_left = (null_data['index']//100000%10).value_counts()

Example Output:

#cycle_left
4    1
1    1
8    1
Name: index, dtype: int64

#cell_left
2    2
0    1
Name: index, dtype: int64

#section_left
1    1
4    1
0    1
Name: index, dtype: int64
Joe Patten
  • 1,664
  • 1
  • 9
  • 15
  • Hi Joe, by changing to `df['index'] = int(df.index[i])` if you mean I've just got below result regarding my dataframe which is right but what about another one also 1 repeated 2 times! 0 6 Name: index, dtype: int64 but it doesn't work for cell and cycle not at all! it would be nice if you follow and apply my dataset to see what I see then we might achieve the goal quickly. thanks man! – Mario Jan 17 '19 at 14:46
  • Can you put your dataset into a better format? Post the dictionary of the dataset if you can (by using `df.to_dict()`). – Joe Patten Jan 17 '19 at 20:19
  • I'm afraid it's unnecessary otherwise I can find answer [here](https://stackoverflow.com/questions/26716616/convert-a-pandas-dataframe-to-a-dictionary) – Mario Jan 18 '19 at 00:02
  • man would u do me a favor and have a look to this [question](https://stackoverflow.com/questions/54489201/how-can-make-a-dataset-of-elements-of-matrices-in-dataframe) You might have an idea? – Mario Feb 03 '19 at 16:40