0
f = open("routeviews-rv2-20181110-1200.pfx2as", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("routeviews-rv2-20181110-1200.pfx2as", dtype='str', 
delimiter="\t", unpack=False)
#convert to dataframe
df = pd.DataFrame(lines,columns=['IPPrefix', 'PrefixLength', 'AS'])
series = df['AS'].astype(str).str.replace('_', ',').str.split(',')
arr = numpy.array(list(chain.from_iterable(series)))
ASes= pd.Series(numpy.bincount(arr))

ValueError: invalid literal for int() with base 10: '31133_65500,65501'

I want to count each time an item appears in col AS. However some lines have multiple entries that need to be counted.

Refer to: Python Find max in dataframe column to loop to find all values

Txt file: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/11/

But that cannot count line 67820 below.

    Out[94]: df=
              A                 B       C
0             1.0.0.0           24  13335
1             1.0.4.0           22  56203
2             1.0.4.0           24  56203
3             1.0.5.0           24  56203
              ...          ...    ...
67820    1.173.142.0            24  31133_65500,65501
              ...          ...    ...
778719  223.255.252.0           24  58519
778720  223.255.254.0           24  55415

The _ is not a typo, that is how it appears in the file.

Desired output.
1335     1
...     ..
31133    1
...     ..
55415    1 
...     ..
56203    3
...     ..
58159    1
...     ..
65500    1
65501    1
...     ..

1 Answers1

1

replace + split + chain

You can replace _ with ,, split and then chain before using np.bincount:

from itertools import chain

series = df['A'].astype(str).str.replace('_', ',').str.split(',')
arr = np.array(list(chain.from_iterable(series))).astype(int)

print(pd.Series(np.bincount(arr)))

0     0
1     0
2     2
3     4
4     1
5     6
6     1
7     0
8     0
9     0
10    1
dtype: int64
jpp
  • 159,742
  • 34
  • 281
  • 339
  • .astype(int) have error OverflowError: Python int too large to convert to C long .astype('int64') gave ValueError: invalid literal for int() with base 10: '' any suggesitons? – Jim Hubbard Nov 17 '18 at 17:44
  • @JimHubbard, Works fine for me, with the data you've provided (which, admittedly, doesn't include types). You can try providing a [mcve], see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jpp Nov 17 '18 at 17:44
  • I know, that was a sample. My file is Length: 778721..with numbers similar in magnitude – Jim Hubbard Nov 17 '18 at 17:46
  • @JimHubbard, Yup, it's tough to help without representative data. If you can find out the row that's causing problems, it'll give me a chance of being able to debug. – jpp Nov 17 '18 at 17:47
  • sorry about that. I provided the actual df above. I did not consider length in my previous example. Note it is now col C, sorry for the confusion. – Jim Hubbard Nov 17 '18 at 18:04
  • I still see no reason why my code shouldn't work given your data. It would be helpful if you could debug and isolate the *single row* causing you problems. – jpp Nov 17 '18 at 18:16
  • I have been playing with it all day. I can not isolate it I tried debugging and got lost in the layers. I attached a link in the question with the text file(routeviews-rv2-20181110-1200). Note the updated code. – Jim Hubbard Nov 19 '18 at 02:33
  • any new suggestions? – Jim Hubbard Nov 20 '18 at 16:24