2

So I have a large dataframe, using pandas.

When I do max(df['A']) it reports a max of 9999 when it should be 396450 by observation.

import numpy as numpy
import pandas as pd

f = open("20170901.as-rel2.txt", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("20170901.as-rel2.txt", dtype='str', comments="#", delimiter="|", unpack=False)
#ignore col 4
lines=lines[:, :3]
#convert to dataframe
df = pd.DataFrame(lines, columns=['A', 'B', 'C'])

After finding the max I have to count each node(col 'A') and say how many times it is repeated.

Here is a sample of the file:

df=
                 A       B   C
    0            2   45714   0
    1            2   52685  -1
    2            3     293   0
    3            3   23248  -1
    4            3  133296   0
    5            3  265301  -1
    6            5   28599  -1
    7            5   52352   0
    8            5  262879  -1
    9            5  265048  -1
    10           5  265316  -1
    11          10   46392   0
    .....
    384338  396238   62605  -1
    384339  396371    3785  -1
    384340  396434   35039  -1
    384341  396450    2495  -1
    384342  396450    5078  -1

    Expect:
    [1, 0
    2, 2
    3, 4
    4, 0
    5, 5
    10, 1
    ....]

I was going to run a for loop of i <= maxvalue (the maxvalue exceeds the number of rows). and use counter. What is the the most effective method?

TrebledJ
  • 8,713
  • 7
  • 26
  • 48

2 Answers2

2

Using Categorical with value_counts

df.A=pd.Categorical(df.A,categories=np.arange(1,max(df.A)+1))
df.A.value_counts().sort_index()
Out[312]: 
1    0
2    2
3    4
4    0
5    5
6    0
7    0
8    0
9    0
Name: A, dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
2

np.bincount

pd.Series(np.bincount(df.A))

0     0
1     0
2     2
3     4
4     0
5     5
6     0
7     0
8     0
9     0
10    1
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • That helped thank you! Now I have something a bit more of a challenge. On a similar note. What if in col A, there was a multiple entries in one line, for ex 396450_396434,396371,396238. I still need to count the occurrence of each number. Any suggestions? read it in another way? – Jim Hubbard Nov 12 '18 at 01:29