I have a non normalised.db file and I need to create a dataframe df_exams using Pandas

Question

I have a non normalised.db file and I need to create a dataframe df_exams from the column 'Exams' of the non-normalised db file. Now the column exams of the non-normalised db file looks like this:

The original non-normalised file has columns of Student ID and Exams like this:

Student ID	Exams
1	exam7 (2017), exam9 (2018), exam3 (2018),...
2	exam2(2017), exam2(2017), exam8 (2018),...
3	exam7 (2017), exam9 (2018), exam3 (2018),...

And I need it like

Student ID	Exam	Year
1	exam7	2017
1	exam9	2018
1	exam3	2018

and so on. I am fairly new to python and appreciate the help.

I had written this code:

df_exams[['Exams','Year']]= df_exams.Exams.str.extract('(.)\s\((.\d+)', expand=True)

This does not produce the desired output.

C.K. · Answer 1 · 2022-12-19T00:12:04.690

0

With your data here:

import pandas as pd
test_data =[
    [1,"exam7 (2017), exam9 (2018), exam3 (2018)"],
    [2,"exam2(2017), exam2(2017), exam8 (2018)"],
    [3,"exam7 (2017), exam9 (2018), exam3 (2018)"]
]
df = pd.DataFrame(test_data,columns=["Student ID","Exams"])

You can do it like this:

result = df.Exams.str.extractall(
    r"([^,]*?)\s*\((\d+)\)"
).reset_index(
    level=["match"],drop=True
).reset_index().rename(
    columns={
        0:"Exam",
        1:"Year",
        "index":"Student ID"
    }
)

And the result is:

    Student ID  Exam    Year
0   0   exam7   2017
1   0   exam9   2018
2   0   exam3   2018
3   1   exam2   2017
4   1   exam2   2017
5   1   exam8   2018
6   2   exam7   2017
7   2   exam9   2018
8   2   exam3   2018

edited Dec 19 '22 at 00:12

answered Dec 18 '22 at 23:48

C.K.

1,409
10
20

sorry, but this is not giving the desired output. It just returns the same thing – Abhipriyo Banerjee Dec 19 '22 at 00:02
@AbhipriyoBanerjee I added my output in the answer. I don't know if you really tried my code. Your code didn't work as you want because you use `str.extract` instead of `str.extractall` and your regex is not very suitable for that. Take a try with my code. – C.K. Dec 19 '22 at 00:13
Hi, I tried your entire code. But it didn't work for me. Might be I implemented it wrong somehow because I didn't get the same output as yours. – Abhipriyo Banerjee Dec 19 '22 at 01:36

score 0 · Accepted Answer · answered Dec 18 '22 at 23:53

I recreated your dataframe using

import pandas as pd

my_df = pd.DataFrame({'Student ID': [1, 2, 3], "Exams": ["exam7 (2017), exam9 (2018), exam3 (2018)"]*3})

then I split your Exams column

my_df["Exams"] =my_df["Exams"].str.split(",")

this puts list of exams in Exams column.

Then I explode the column to make multiple rows:

my_df = my_df.explode("Exams")

which gives:

   Student ID          Exams
0           1   exam7 (2017)
0           1   exam9 (2018)
0           1   exam3 (2018)
1           2   exam7 (2017)
1           2   exam9 (2018)
1           2   exam3 (2018)
2           3   exam7 (2017)
2           3   exam9 (2018)
2           3   exam3 (2018)

Now I extract the values required using regex:

my_df["Year"] = my_df["Exams"].str.extract(r"\((\d{4})\)")
my_df["Exam"] = my_df["Exams"].str.extract(r"(exam\d{1,2})")

Which gives me a dataframe like this:

   Student ID          Exams  Year   Exam
0           1   exam7 (2017)  2017  exam7
0           1   exam9 (2018)  2018  exam9
0           1   exam3 (2018)  2018  exam3
1           2   exam7 (2017)  2017  exam7
1           2   exam9 (2018)  2018  exam9
1           2   exam3 (2018)  2018  exam3
2           3   exam7 (2017)  2017  exam7
2           3   exam9 (2018)  2018  exam9
2           3   exam3 (2018)  2018  exam3

You can filter the columns you want like:

my_cols = my_df[["Stzudent ID", "Exam", "Year"]]

and your dataframe will look like what you have posted here.

Thanks a lot. You saved me. I was stuck in this for three days almost. — Abhipriyo Banerjee, Dec 19 '22 at 01:35
Hi, I had one doubt more.. if I had an additional column named scores having data like 68,70,52 for each cell value in exams exam7 (2017), exam9 (2018), exam3 (2018), would it be possible to enlist the scores for each exam and year as done previously? — Abhipriyo Banerjee, Dec 19 '22 at 01:50
Hi Abhipriyo, you can indeed do it, instead of exploding only "exams" like I have done here, you can set index to your Student ID and explode other columns like here https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe . — Bijay Regmi, Dec 19 '22 at 09:40

I have a non normalised.db file and I need to create a dataframe df_exams using Pandas

2 Answers2