How to create DataFrame output with some conditions?

Question

I have 2 list of string containing pitch value:

pitch_detected = ['A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4']
pitch_dataset = ['G#3 / Ab3', 'A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4', 'F#4 / Gb4', 'G4']

I need to print them at HTML page later. Right now, I use this code, and I face no problem.

  df = pd.DataFrame(data = [pitch_dataset, pitch_class])
  df.index = ['Pitch from dataset', 'Pitch detected']
  df = df.T
  table_result = df.to_html()

This is the current HTML output created from dataframe df.

Altough the output is looking nicely, I need to present it with some conditions. I think, this can be achieved by manipulating the DataFrame first before printing to HTML using to_html(). Here are the conditions:

If pitch X is present in both pitch_dataset and pitch_detected, the pitch X will be shown side-by-side.
If pitch X is present in pitch_dataset but not present in pitch_detected, the row in pitch_detected will be filled by -.
If pitch X is not present in pitch_dataset but present in pitch_detected, the pitch X will be shown in pitch_detected column, the extra rows in pitch_dataset will be filled by -.

Is there any way to achieve this condition with Python?

This is the sample desired output I want, generated by Microsoft Excel, with comments on condition. Condition 1 and 2 is the same as the output table given above, while condition 3 is manually made for the purpose of giving examples, with the pitch_detected list would be:

pitch_detected = ['A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4', 'A4', 'B4']

At the real output, I require no highlighting.

Edit: because I have received notification of a duplicate question here, I need to address that the problem I face is different from the suspected duplicate question.

Does this answer your question? [conditional column output for pandas dataframe](https://stackoverflow.com/questions/30321932/conditional-column-output-for-pandas-dataframe) — nyedidikeke, Apr 17 '21 at 10:56
Unfortunately not, it is different from my problem., where I need to do some formatting — Dionisius Pratama, Apr 17 '21 at 11:04
Would the detected pitch in the Excel example be represented as `pitch_detected = ['A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4', 'A4', 'B4]`? I have a solution for conditions 1 & 2, waiting for precisions concerning 3. — Guillaume Ansanay-Alex, Apr 17 '21 at 12:44
I understand from your comments on the answers that you're not interested in keeping the order? What happens if there should be repetitions of the same pitches? — Guillaume Ansanay-Alex, Apr 17 '21 at 13:12
@GuillaumeAnsanay-Alex for my case, there would be no case of same pitch repetition. Before the data presented with pitch letters (C, D, E, F, G, A, B), they are originally a list of MIDI numbers. The list has 250 values, and then I extract only unique values from that list. It's true that there will be repetition in the original MIDI number list, but when they are converted into pitch letters, there would be no repetition of a same pitch letters. — Dionisius Pratama, Apr 17 '21 at 13:17
@GuillaumeAnsanay-Alex, if you would like to know how the unique value of the list containing 250 MIDI numbers are converted into pitch letters, you can read this post [here](https://www.inspiredacoustics.com/en/MIDI_note_numbers_and_center_frequencies). — Dionisius Pratama, Apr 17 '21 at 13:22
See my edit on simplified coding. You are free to choose whatever version of the codings. — SeaBean, Apr 17 '21 at 17:36

score 1 · Answer 1 · answered Apr 17 '21 at 12:58

One solution is to use Python datastructures since your conditions are not well suited for pandas table/vector logic.

    intersection = list(set(pitch_detected) & set(pitch_dataset))
    fillvalue = '-'
    intersected_data = [(val, val) for val in intersection]
    dataset_data = [(val, fillvalue) for val in pitch_dataset if val not in intersection]
    detected_data = [(fillvalue, val) for val in pitch_detected if val not in intersection]
    data = intersected_data + dataset_data + detected_data
    df = pd.DataFrame(data, columns=['Pitch from dataset', 'Pitch detected'])

First, you build the intersection of the data as described in this answer.

Then you build three lists:

list of all intersected items
list of items only present in pitch_datset
list of items only present in pitch_detected

Finally, you join all lists and build your dataframe.

My result looks like this:

   Pitch from dataset Pitch detected
0                  F4             F4
1           D#4 / Eb4      D#4 / Eb4
2           A#3 / Bb3      A#3 / Bb3
3           C#4 / Db4      C#4 / Db4
4                  D4             D4
5                  E4             E4
6                  B3             B3
7                  C4             C4
8           G#3 / Ab3              -
9           F#4 / Gb4              -
10                 G4              -
11                  -             A4
12                  -             B4

Thank you! Your solution worked perfectly. Anyway, if I may add something, because the pitch octave (written 3 and 4 on the example) is kinda important, I added ```intersected_data.sort(), dataset_data.sort(), detected_data.sort()``` to your code. — Dionisius Pratama, Apr 17 '21 at 13:08

SeaBean · Accepted Answer · 2021-04-17T17:34:14.523

You can merge df with itself and extract the 2 columns with aligned Pitch X (side by side). Then replace NaN values with -, as follows:

df2 = df.merge(df, left_on='Pitch detected', right_on='Pitch from dataset', how='outer', indicator=True)[['Pitch from dataset_y', 'Pitch detected_x']]

df3 = df2.dropna(how='all').fillna('-').rename(columns={'Pitch from dataset_y': 'Pitch from dataset', 'Pitch detected_x': 'Pitch detected'})




print(df3)

   Pitch from dataset Pitch detected
0           A#3 / Bb3      A#3 / Bb3
1                  B3             B3
2                  C4             C4
3           C#4 / Db4      C#4 / Db4
4                  D4             D4
5           D#4 / Eb4      D#4 / Eb4
6                  E4             E4
7                  F4             F4
8                   -             A4
9                   -             B4
11          G#3 / Ab3              -
12          F#4 / Gb4              -
13                 G4              -

Here, let me breakdown the steps for your reference:

With just the merge step, we get:

df.merge(df, left_on='Pitch detected', right_on='Pitch from dataset', how='outer', indicator=True)



   Pitch from dataset_x Pitch detected_x Pitch from dataset_y Pitch detected_y      _merge
0             G#3 / Ab3        A#3 / Bb3            A#3 / Bb3               B3        both
1             A#3 / Bb3               B3                   B3               C4        both
2                    B3               C4                   C4        C#4 / Db4        both
3                    C4        C#4 / Db4            C#4 / Db4               D4        both
4             C#4 / Db4               D4                   D4        D#4 / Eb4        both
5                    D4        D#4 / Eb4            D#4 / Eb4               E4        both
6             D#4 / Eb4               E4                   E4               F4        both
7                    E4               F4                   F4               A4        both
8                    F4               A4                  NaN              NaN   left_only
9             F#4 / Gb4               B4                  NaN              NaN   left_only
10                   G4             None                  NaN              NaN   left_only
11                  NaN              NaN            G#3 / Ab3        A#3 / Bb3  right_only
12                  NaN              NaN            F#4 / Gb4               B4  right_only
13                  NaN              NaN                   G4             None  right_only

The middle 2 columns already have what we want.

The other steps are just to extract and reformat the middle 2 columns to your desired format, detailed bellow:

df2 just extracts the middle middle 2 columns from the merge result

Then, we drop those rows (of middle 2 columns) with all None / NaN values.

Replace NaN values with - and then rename the columns without the _x, _y.

Edit

The codes can be further simplified and at the same time the output more resembling to the desired output, as follows:

(df[['Pitch from dataset']].merge(
    df[['Pitch detected']], 
    left_on='Pitch from dataset', 
    right_on='Pitch detected', 
    how='outer')
).dropna(how='all').fillna('-')

Note that we use double square brackets [[...]] here for the 2 columns during .merge(). We are, in effect, merging 2 dataframes each with one column only. Thus, save us subsequent effort on renaming merged columns.

Result:

   Pitch from dataset Pitch detected
0           G#3 / Ab3              -
1           A#3 / Bb3      A#3 / Bb3
2                  B3             B3
3                  C4             C4
4           C#4 / Db4      C#4 / Db4
5                  D4             D4
6           D#4 / Eb4      D#4 / Eb4
7                  E4             E4
8                  F4             F4
9           F#4 / Gb4              -
10                 G4              -
11                  -             A4
12                  -             B4

@DionisiusPratama Yes, `df.merge()` is a handy function to explore and use. With its `indicator=True` option, you can easily see from which dataframe the merged rows are from. Just make good use of it. — SeaBean, Apr 17 '21 at 14:32

Corralien · Answer 3 · 2021-04-17T15:33:27.757

1

Another method:

pitch_detected = ['A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4', 'A4', 'B4']
pitch_dataset = ['G#3 / Ab3', 'A#3 / Bb3', 'B3', 'C4', 'C#4 / Db4', 'D4', 'D#4 / Eb4', 'E4', 'F4', 'F#4 / Gb4', 'G4']

sr1 = pd.Series(pitch_dataset, name="Pitch from dataset").sort_values()
sr2 = pd.Series(pitch_detected, name="Pitch detected").sort_values()

df = pd.concat([sr1.loc[sr1.isin(sr2)].reset_index(drop=True),
                sr2.loc[sr2.isin(sr1)].reset_index(drop=True)], axis="columns")

df = df.append(sr1.loc[~sr1.isin(sr2)].reset_index(drop=True).to_frame())
df = df.append(sr2.loc[~sr2.isin(sr1)].reset_index(drop=True).to_frame())

df = df.reset_index(drop=True).fillna("-")  # <- replace NaN by "-"

Output:

>>> df 
   Pitch from dataset Pitch detected
0           A#3 / Bb3      A#3 / Bb3
1                  B3             B3
2                  C4             C4
3           C#4 / Db4      C#4 / Db4
4                  D4             D4
5           D#4 / Eb4      D#4 / Eb4
6                  E4             E4
7                  F4             F4
8           G#3 / Ab3              -
9           F#4 / Gb4              -
10                 G4              -
11                  -             A4
12                  -             B4

edited Apr 17 '21 at 15:33

answered Apr 17 '21 at 13:10

Corralien

109,409
8
28
52

Where should I use ```fillna('-')``` to replace ```NaN``` values? – Dionisius Pratama Apr 17 '21 at 13:40
I tried adding ```df.fillna('-')``` after ```df.reset_index```, but it still returns ```NaN``` values. – Dionisius Pratama Apr 17 '21 at 13:58
1

This solution is also nice! But it needs the matching items in the 2 lists being in same sequence (this is quite possible as the sample lists are already not kept in sorted sequences). Otherwise, the `pd.concat()` result won't show the matching items side by side. – SeaBean Apr 17 '21 at 15:22
@SeaBean, you are absolutely right! I have tried to shuffle the two lists and my solution is no longer consistent. I have to sort by values the two series. – Corralien Apr 17 '21 at 15:32
Good! Frankly speaking, I like your solution too! Please don't mind, I just used to review other people's solutions too so that I can learn from others too! :-) – SeaBean Apr 17 '21 at 16:16
Thank you! @Corralien This worked nicely too now. – Dionisius Pratama Apr 18 '21 at 08:01

How to create DataFrame output with some conditions?

3 Answers3

Edit