I hope some more experienced python'ers are able to help me towards a more streamline approach for my current code.
WHAT I GOT/WANT:
I am having a table looking like this:
study id; rack position; box number; freeze-thaw cycles; new rack position; new box number; project code
24; A1; 10001; 1; A2; 11040; 1,2
25; B1; 10002; 0; A4; 11045; 1
26; C2; 10003; 0; A5; 13420; 2
which I want to parse into the following format (id = study id, count = freeze-thaw cycles - and where multiple project codes are separated and put in separate rows):
id; field; count; value
24; rack position; 1; A1
24; box number; 1; 10001
24; new rack position; 1; A2
24; new box number; 1; 11040
24; project code; 1; 1
24; project code; 1; 2
25; rack position; 0; B1
25; box number; 0; 10002
25; new rack position; 0; A2
25; new box number; 0; 11040
25; project code; 0; 1
26; and so on...
HOW FAR I GOT:
# import pandas
import pandas as pd
# reading in the data
df = pd.read_table('data.csv', delimiter=';')
# rename "study id" and "freeze-thaw cycles"
df = df.rename(columns={'study id': 'id', 'freeze-thaw cycles': 'count'})
# splitting "project code"
df = df.join(df['project code'].str.split(',', 1, expand=True).rename(columns={0:'Project code1', 1:'Project code2'}))
# remove "project code"
df = df.drop('project code', 1)
# Split the dataframe based on Project code1 and Project code2
df1 = df[['box number', 'rack position', 'id', 'count', 'new box number',
'new Rack Position', 'Project code1']]
df2 = df[['box number', 'rack position', 'id', 'count', 'new box number',
'new rack position', 'Project code2']]
# rename Project code1 and Project code2 to Project code
df1 = df1.rename(columns={'Project code1': 'Project code'})
df2 = df2.rename(columns={'Project code2': 'Project code'})
# concatenate the dataframes based on "Project code"
df = pd.concat([df1, df2], axis=0) #axis=0 for columns, axis=1 for rows
# convert the data frame into the long format
df = pd.melt(df, id_vars=['id', 'count'], var_name='field', value_name='value')
ISSUE:
The splitting of "project code" results in a number of rows with the value of 0, like this:
id; field; count; value
32; project code; 1; 0
33; project code; 1; 0
34; project code; 1; 0
35; project code; 1; 0
How do I fix this (preferably integrated into the code)? Is there a more streamline approach to this (my code seems pretty fragmented)?
Thank you!
Cheers, Birgitte