Panda value_counts index output looks weird

Question

df = pd.read_csv('./data/flights_2015.csv', low_memory=False)
print('Dataframe dimensions:', df.shape)

Dataframe dimensions: (5819079, 31)

I tried to count how many flights there for each airport in the entire dataset using

count_flights = df['ORIGIN_AIRPORT'].value_counts()

Its output looks like this:

ATL      346836
ORD      285884
DFW      239551
DEN      196055
LAX      194673
          ...  
13541        11
10165         9
14222         9
13502         6
11503         4

Most of counts look correct, but why did I get 13541, 10165, 14222, 13502, 11503 these numbers in the index column? flights_2015.csv does not even have that kind of number on 'ORIGIN_AIRPORT' column.

What happened here?

Please provide a [mre]. Maybe, if it's a publicly available dataset, you could post a link? Otherwise, see [How to make good reproducible pandas examples](/q/20109391/4518341). — wjandrea, Apr 01 '23 at 03:37
Is [this](https://www.kaggle.com/datasets/usdot/flight-delays) the dataset? It does in fact have all those numbers in the `ORIGIN_AIRPORT` column. I'm not sure why you think it doesn't. — wjandrea, Apr 01 '23 at 03:52
there must be those # in your dataset. To check, run `df.loc[df['ORIGIN_AIRPORT']==13541]`, this will show you those data — PTQuoc, Apr 01 '23 at 04:25
You can imagine you have the code for all US airports and a number for other country airports (return flight)? — Corralien, Apr 01 '23 at 05:55

insanity · Answer 1 · 2023-04-01T16:06:10.657

It seems that the dataset being referred here is https://www.kaggle.com/datasets/usdot/flight-delays

As df.shape is (5819079, 31) (same as mentioned in question)

import pandas as pd
import os

df = pd.DataFrame()
try: 
     df = pd.read_csv(os.path.join("./flights.csv"), sep=',')
except IOError as e:
     print (e)


df_1 = df.groupby(['AIRLINE','ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').count()).reset_index()
df_1.head()

    AIRLINE ORIGIN_AIRPORT  FLIGHT_NUMBER
0   AA  10140   174
1   AA  10257   80
2   AA  10299   32
3   AA  10397   917
4   AA  10423   786

OR If you need to sum

df_1 = df.groupby(['AIRLINE','ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').sum()).reset_index()

AIRLINE ORIGIN_AIRPORT  FLIGHT_NUMBER
0   AA  10140   294725
1   AA  10257   154277
2   AA  10299   23914
3   AA  10397   1154031
4   AA  10423   1001708


df_1 = df.groupby(['ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').sum()).reset_index()

df_1.head()

ORIGIN_AIRPORT  FLIGHT_NUMBER
0   10135   1144242
1   10136   624218
2   10140   3382579
3   10141   493757
4   10146   433521

Panda value_counts index output looks weird

1 Answers1