Filtering Excel Document Data In Jupyter Notebook Using Pandas

Question

I have a code Filtering Data, that I wan't displayed from an Excel Document using Pandas in Jupyter Notebook. It is for a UK RAF Historic Aircraft Display Team, Year 2009 Appearance Schedule.

Here is my Python Code :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]

I am unsure what to type, to filter the Data, for when the Numerical Value in the BID Column, is the same in the BID column in the next row. And also in addition, only when one of the Aircraft in the A/C Column, where both Numerical Values in the BID Column below and above are the same, is DAK, and excluding that principle, only if in a row for the A/C Column shows DHS Could someone please tell me, what I should add to my Python Code to enable this, it would be much appreciated if someone could.

Also for example with the filtered data for example, I would like :-

Output:

145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610

Changed to showing the following, i.e. merging the two lines together :-

Output:

SCARBOROUGH     DS  DISPLAY     2008-05-25 00:00:00     610

And

Output:

173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132

Changed to showing :-

Output:

TARRANT RUSHDEN     DHS     DISPLAY     NaN     132

I mean changed to showing, for all those occurrences,

Not just for those two Venues.

Here is an Sample Of My Output Data :-

Venue   A/C     DISPLAY/    Date    BID
25  SHUTTLEWORTH    DAK     DISPLAY     NaN     529
55  KEMBLE  DAK     DISPLAY     NaN     461
69  NORTHWICH   SPIT    DISPLAY     2008-05-10 00:00:00     514
72  POCKLINGTON     SPIT    DISPLAY     2009-05-10 00:00:00     821
75  BERLIN  DAK     DISPLAY     2008-05-12 00:00:00     587
78  MILDENHALL  SPIT    DISPLAY     2009-05-15 00:00:00     920
93  DUXFORD     HS  DISPLAY     NaN     611
103     CRANWELL    HS  DISPLAY     2008-05-20 00:00:00     44
145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610
151     CORBRIDGE   SPIT    DISPLAY     NaN     353
167     BRIDGEND-CNX    SPIT    DISPLAY     2008-05-31 00:00:00     527
173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132
179     NORTHOLT    SPIT    DISPLAY     2009-06-05 00:00:00     870
214     BRIZE NORTON    HS  DISPLAY     NaN     939
218     ROPLEY  HS  DISPLAY     2008-06-13 00:00:00     355
223     THWAITES    HS  DISPLAY     NaN     364
231     ROPLEY  HS  DISPLAY     NaN     355
240     COSFORD     HS  DISPLAY     2008-06-14 00:00:00     667
241     QUORN   HS  DISPLAY     NaN     314
244     COSFORD     DAK     DISPLAY     2008-06-14 00:00:00     NaN
260     REDHILL     SPIT    DISPLAY     NaN     686
269     KEMBLE  DAK     DISPLAY     NaN     316
270     KEMBLE  HS  DISPLAY     NaN     316
280     KEMBLE  SPIT    DISPLAY     2008-06-21 00:00:00     316
285     KEMBLE  DAK     DISPLAY     2008-06-21 00:00:00     316

Here is the Website Link, to the .xls i.e. Excel Document File :-

http://web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/rafcms/mediafiles/F0ED6EA8_1143_EC82_2E4534A1036AA506.xls

You will obviously need to change the following in my Python Code, to whatever you call the .xls File. And the path, of where you save it, on your Computer :-

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

I have changed the end bit of the Code to :-

selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
groupby_venue_date = selected.groupby(['Venue', 'BID', 'DISPLAY/'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)

But get a :- IndexingError: Too many indexers message, when I run the Code, what does that mean ? And what has caused the Error Bill ?

This is the Code I am currently running as of 2nd January 2020 :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

#df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] 
df["Date"].fillna("No Date", inplace = True)

df['A/C'].unique().tolist()

rename_map = {
'DAK': 'D',
'SPIT': 'S',
'LANC': 'L',
'HURRI': 'H',
'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

#selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
selected = df.loc[df['DISPLAY/'] == 'DISPLAY']

groupby_venue_date = selected.groupby(['Venue', 'BID', 'Date', 'DISPLAY/']) 
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)

Sounds like you want to remove some duplicates which are identical except in the `'A/C'` column. Is that right? But what is the logic for the replacement values `'DS'` and `'DHS'` which appear in that column after the merge? — Bill, Jan 01 '20 at 20:48
Also, could you provide a sample of the input data? Either part of the csv file or part of `df` maybe. Then we can run your script to see what is happening. — Bill, Jan 01 '20 at 20:58
Your nearly right Bill, I actually want to keep the duplicates which are identical, except in the 'A/C' Column. Answering your other point, DS stands for Dakota and Spitfire. And DHS stands for Dakota Spitfire and Hurricane. — Edward Winch, Jan 01 '20 at 20:59
Shall I provide the Website Link, to the xls File, so it can be downloaded ? — Edward Winch, Jan 01 '20 at 21:01
Yes. Good idea. You can edit the question and include the link there. — Bill, Jan 01 '20 at 21:01
How about if we group all the selected entries with the same venue and date and combine all the aircraft codes as a list in the 'A/C' column? Would that suffice? — Bill, Jan 01 '20 at 21:03
I will get the Website Url for you. Then Could you run my Python Code in Jupyter Notebook ? And if my aim will take too long to achieve, we then go with your suggestion ? Eddie — Edward Winch, Jan 01 '20 at 21:08
Hi Bill, Here is the Website Link, to the .XLS File, i.e. Excel Document File :- http://web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/rafcms/mediafiles/F0ED6EA8_1143_EC82_2E4534A1036AA506.xls — Edward Winch, Jan 01 '20 at 21:16
And Bill, I want to lose the other Rows. In terms of the Duplicate Rows, one of the Values must be 'DAK' in the 'A/C' Column. — Edward Winch, Jan 01 '20 at 22:20
With respect to the index error, try this instead `selected = df.loc[df['A/C'].isin(['DS', 'DH', 'DHS'])]`. See [this answer](https://stackoverflow.com/a/12098586/1609514). — Bill, Jan 02 '20 at 21:24

Bill · Answer 1 · 2020-01-02T17:35:24.267

0

I'm not sure I understand exactly what you want to do but I'll try to help by providing some techniques that might help you figure it out.

For example, getting a list of the unique values for a column:

df['A/C'].unique().tolist()

[nan, 'L', 'S', 'H', 'LHS', 'LANC', 'DAK', 'SPIT', 'HS', 'HURRI', 'PARA', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

Part of the problem appears to be dealing with these short-hand entries which are combinations of different aircraft. E.g. you said 'DHS' stands for Dakota, Spitfire, and Hurricane. It might be better to deal these non-standard values first before trying to merge the rows. One way is to replace all non-standard values using a dictionary.

For example

rename_map = {
    'DAK': 'D',
    'SPIT': 'S',
    'LANC': 'L',
    'HURRI': 'H',
    'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

[nan, 'L', 'S', 'H', 'LHS', 'D', 'HS', 'P', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

You can then do whatever it is you want. For example, select a sub-set of the data:

selected = df.loc[df['DISPLAY/'] == 'DISPLAY']
assert selected.shape == (202, 6)

And then group rows by selected columns and joining the aircraft codes using the string-join method:

groupby_venue_date = selected.groupby(['Venue', 'Date'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
assert aircraft.index.duplicated().sum() == 0
print(aircraft.shape)
print(aircraft.head())

(89,)
Venue     Date      
AUDLEM    2008-07-26      S
AYLSHAM   2008-08-31    LHS
BEAULIEU  2008-05-25      H
BELTRING  2008-07-26      L
BENSON    2008-08-27    LHS
Name: Aircraft-combined, dtype: object

Some of the values have been joined:

print(aircraft.unique().tolist())
['S', 'LHS', 'H', 'L', 'D', 'HS', 'HSD', 'SLH', 'DHS', 'SD', 'SSSHH', 'LH', 'DS', 'DH', 'HSL']

UPDATE

You can do other operations on these codes by making functions and using the apply method.

For example sorting the string, or removing duplicated characters (which also happens to sort them).

def sorted_string(s):
    return ''.join(sorted(s))

def remove_duplicate_chars(s):
    return ''.join(set(s))

aircraft = aircraft.apply(remove_duplicate_chars)
print(aircraft.unique().tolist())

['S', 'LHS', 'H', 'L', 'D', 'HS', 'DHS', 'DS', 'LH', 'DH']

edited Jan 02 '20 at 17:35

answered Jan 01 '20 at 22:12

Bill

10,323
10
62
85

Thankyou so much Bill, for all your help this evening. Just wondering what do the following lines of Code mean ? :- assert selected.shape == (202, 6) + assert aircraft.index.duplicated().sum() == 0 ? – Edward Winch Jan 01 '20 at 23:14
And what would I need to type, so that all, the DHS, DS, DH, are displayed ? Is it possible to have the Data displayed changed to the latest, with the original Font, instead of Boolean Type Text ? – Edward Winch Jan 01 '20 at 23:19
You can ignore those. Just checks to make sure the number of rows and columns is correct and to demonstrate that after the groupby there are no duplicates left. You don't need them. – Bill Jan 01 '20 at 23:20
I.e. showing, like when you run the original Python Code, in Jupyter Notebook, only with the current changes ? – Edward Winch Jan 01 '20 at 23:25
I think you're referring to the `print(aircraft.head())` statement. `aircraft` is `pd.Series` not a `DataFrame`. That is why it displays like that. To see the 'pretty' version in a Jupyter notebook use `pd.DataFrame(aircraft)` – Bill Jan 01 '20 at 23:28
Once I know how to have all DHS, DS, and DH Rows shown, I will be able to mix and match accordingly, with the L i.e. Lancaster Equivalents. – Edward Winch Jan 01 '20 at 23:32
I see Bill, yes thats what I meant )) – Edward Winch Jan 01 '20 at 23:32
I see one problem with my solution. The data for `'TARRANT RUSHDEN'` is missing from `aircraft`. I think the groupby is ignoring it because of the `NaN` values in the date column. You might have to replace the NaN values e.g. using `fillna()` (if you are sure that the records are on the same date). – Bill Jan 01 '20 at 23:37
I have updated my Original Post Bill, if you could take a look for me ? – Edward Winch Jan 01 '20 at 23:57
Hi Bill, I have sorted out the fillna() Code, I will edit my Post accordingly tomorrow to show this. Once I run the Code, what do I then type Codewise, to keep only rows, when in the Aircraft Combined Column, there is a character D, and must contain at least two characters i.e. string length 2 or more but not DL ? – Edward Winch Jan 02 '20 at 03:12
Hi Bill, I have updated the Python Code, which is shown in my edited post, including the Line of Code for fillna(). If you could take a look that would be great. – Edward Winch Jan 02 '20 at 11:07
And also, how do I change in the Aircraft Combined Column, all values showing HSD to showing DSH and all SD to showing DS ? It doesn't look as good when it is showing backwards. – Edward Winch Jan 02 '20 at 11:18
I extended the answer to show how to do that but I think we're trying to do too much in one post here. In future, try to break down the separate issues and post as separate questions with shorter code snippets. – Bill Jan 02 '20 at 17:38
Thanks for the Info Bill, What Code should be typed,to keep only rows, when in the Aircraft Combined Column, there is a character D, and must contain at least two characters i.e. string length 2 or more but not DL ? Then do the sort ? – Edward Winch Jan 02 '20 at 20:00
When I add your Code to the other Code and run in Jupyter Notebook, the Characters are still the wrong way round, and some that were correct, are now the wrong way round. Is there code that can be typed, to say when the Characters ,either combination SD or DS they should always be shown as DS, and characters in any combination of DSH, should always show as DSH in the Aircraft-Combined Column etc ? Could I move this to chat ? As you say it is a very long Thread now ? Eddie – Edward Winch Jan 02 '20 at 20:00
First check to see if you can find answers to these additional questions on stackoverflow. For example [filter string based on its length](https://stackoverflow.com/questions/19937362/python-pandas-filter-string-data-based-on-its-string-length). Then if not, post them as new questions showing the new code. – Bill Jan 02 '20 at 21:21
Ok Bill will do. – Edward Winch Jan 02 '20 at 22:07
B.t.w. if you found my answer helpful, please upvote it. – Bill Jan 03 '20 at 19:04
Hi Bill, Could you check out, the following two Threads of mine, on the Python Forum ? I need help finishing of my Python Codes, if that is okay with you, to help me ? Here are the Thread links :- https://python-forum.io/Thread-Obtaining-Correct-Date-In-Pandas-DataFrame and :-. https://python-forum.io/Thread-Filtering-Excel-Document-Data-Based-On-Numerical-Values In that Thread, if you could look at the 12th Post, on page 2 in particular. I hope you can help me. Best Regards Eddie Winch :-) – Edward Winch Jan 24 '20 at 20:31

Filtering Excel Document Data In Jupyter Notebook Using Pandas

1 Answers1