How to filter out and extract data from a dataframe which contains multiple lines per row?

Question

I have a dataframe that looks something like this

   Detail
  --------------
0  Animal: Lion,
    Weight: 600 kg,
    Class: Mammal

1  Animal: Monkey,
    Weight: 10 kg,

2  Animal: Snake,
    Class: Reptile

3  Animal: Frog,
    Class: Amphibian,
    Weight: 1 kg

4  Animal: Hawk,
    Class: Bird

I need to read each row to see if a particular string is present in each line, then strip the line and add it to a list. If the value is not present, I need it to be appended with a null value. Then, I want to export this list as a csv file.

My desired output here is

   Animal
  --------------
0  Animal: Lion,
1  Animal: Monkey,
2  Animal: Snake,
3  Animal: Frog,
4  Animal: Hawk,


   Weight
  --------------
0  Weight: 600 kg,
1  Weight: 10 kg,
2  Weight: NaN
3  Weight: 1 kg,
4  Weight: NaN

   Class
  --------------
0  Class: Mammal,
1  Class: NaN
2  Class: Reptile
3  Class: Amphibian
4  Class: NaN

I need these 3 dataframes to be exported as a csv file.

I have written something like this, but don't think it is the right approach

cols= ['Animal','Weight','Class']


for entry in cols:
    values=[]
    for content in df['Detail']:
        if(df['Detail'].str.contains(entry)):
            var=content.str.split('\n')
        else:
            var= np.nan
        values.append(var)
    values_df = pd.DataFrame(values)
    values_df.to_csv('%s.csv' %entry,header=entry,index=False)

Have you tried- splitting at comma(',') and then extracting those values into separate columns ? — Satya, Jun 06 '22 at 10:20
Yea, I did but I didn't but didn't get the result I wanted though. — Marvin, Jun 06 '22 at 10:46

John Luther · Answer 1 · 2022-06-06T10:35:37.090

0

Please check this. Splitting a pandas dataframe column by delimiter I haven't run it, but it should be something like:

df[['Animal','Weight', 'Class']] = df['Detail'].str.split(',',expand=True)

If you want then to convert the column "Animal" to a list. you can do it like so:

arr = df["Animal"].to_numpy()

edited Jun 06 '22 at 10:35

answered Jun 06 '22 at 10:24

John Luther

1
1

score 0 · Accepted Answer · answered Jun 06 '22 at 11:00

In a way, your trial is correct, the only part you have missed is filtering the correct column values(as Animal, Weight, Class will not be in order)
#splitted at comma and removed extra space if any in all values

df['Detail'] = [[y.strip() for y in x.split(',')] for x in df['Detail'].values]
vals = []
for idx, row in df.iterrows():
    temp = {'Animal':np.nan,'Weight':np.nan,'Class':np.nan}
    for x in row['details']:
        for key in temp.keys():
            if x.startswith(key):
                temp[key]= x
    vals.append(temp)
final_df = pd.DataFrame(vals)
#print('final_df==', final_df)

Now you final_df will have 3 columns, and you can easily create separate dfs out of it by

final_df[['columnName']].to_csv('filaname.csv', index=False)

Let me know if it helps or in case you need more clarity on code.

Thanks, this works perfectly. But now I am working with a dataset which has 20 or more so lines. So, I tried to put all the column names to a list, then assigned a dictionairy with all values as NaN. I tried something like this temp={} for i in col_names: temp[i] = np.nan This change resulted in each column of final_df having the same values. Would really like to know, why this happened and if there is any soln for it. But still nevertheless, thanks this answer has helped me out so much. — Marvin, Jun 06 '22 at 12:41

How to filter out and extract data from a dataframe which contains multiple lines per row?

2 Answers2