-2
df = pd.read_csv('./data/flights_2015.csv', low_memory=False)
print('Dataframe dimensions:', df.shape)
Dataframe dimensions: (5819079, 31)

I tried to count how many flights there for each airport in the entire dataset using

count_flights = df['ORIGIN_AIRPORT'].value_counts()

Its output looks like this:

ATL      346836
ORD      285884
DFW      239551
DEN      196055
LAX      194673
          ...  
13541        11
10165         9
14222         9
13502         6
11503         4

Most of counts look correct, but why did I get 13541, 10165, 14222, 13502, 11503 these numbers in the index column? flights_2015.csv does not even have that kind of number on 'ORIGIN_AIRPORT' column.

What happened here?

mario119
  • 343
  • 3
  • 14
  • Please provide a [mre]. Maybe, if it's a publicly available dataset, you could post a link? Otherwise, see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Apr 01 '23 at 03:37
  • 1
    Is [this](https://www.kaggle.com/datasets/usdot/flight-delays) the dataset? It does in fact have all those numbers in the `ORIGIN_AIRPORT` column. I'm not sure why you think it doesn't. – wjandrea Apr 01 '23 at 03:52
  • there must be those # in your dataset. To check, run `df.loc[df['ORIGIN_AIRPORT']==13541]`, this will show you those data – PTQuoc Apr 01 '23 at 04:25
  • You can imagine you have the code for all US airports and a number for other country airports (return flight)? – Corralien Apr 01 '23 at 05:55

1 Answers1

0

It seems that the dataset being referred here is https://www.kaggle.com/datasets/usdot/flight-delays

As df.shape is (5819079, 31) (same as mentioned in question)

import pandas as pd
import os

df = pd.DataFrame()
try: 
     df = pd.read_csv(os.path.join("./flights.csv"), sep=',')
except IOError as e:
     print (e)


df_1 = df.groupby(['AIRLINE','ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').count()).reset_index()
df_1.head()

    AIRLINE ORIGIN_AIRPORT  FLIGHT_NUMBER
0   AA  10140   174
1   AA  10257   80
2   AA  10299   32
3   AA  10397   917
4   AA  10423   786

OR If you need to sum

df_1 = df.groupby(['AIRLINE','ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').sum()).reset_index()

AIRLINE ORIGIN_AIRPORT  FLIGHT_NUMBER
0   AA  10140   294725
1   AA  10257   154277
2   AA  10299   23914
3   AA  10397   1154031
4   AA  10423   1001708


df_1 = df.groupby(['ORIGIN_AIRPORT'])['FLIGHT_NUMBER'].agg(lambda x: pd.to_numeric(x, errors='coerce').sum()).reset_index()

df_1.head()

ORIGIN_AIRPORT  FLIGHT_NUMBER
0   10135   1144242
1   10136   624218
2   10140   3382579
3   10141   493757
4   10146   433521
insanity
  • 26
  • 4