1

I hope some more experienced python'ers are able to help me towards a more streamline approach for my current code.

WHAT I GOT/WANT:
I am having a table looking like this:

study id; rack position; box number; freeze-thaw cycles; new rack position; new box number; project code 
24; A1; 10001; 1; A2; 11040; 1,2 
25; B1; 10002; 0; A4; 11045; 1 
26; C2; 10003; 0; A5; 13420; 2

which I want to parse into the following format (id = study id, count = freeze-thaw cycles - and where multiple project codes are separated and put in separate rows):

id; field; count; value
24; rack position; 1; A1
24; box number; 1; 10001
24; new rack position; 1; A2
24; new box number; 1; 11040
24; project code; 1; 1
24; project code; 1; 2
25; rack position; 0; B1
25; box number; 0; 10002
25; new rack position; 0; A2
25; new box number; 0; 11040
25; project code; 0; 1
26; and so on...

HOW FAR I GOT:

# import pandas
import pandas as pd

# reading in the data
df = pd.read_table('data.csv', delimiter=';')

# rename "study id" and "freeze-thaw cycles"
df = df.rename(columns={'study id': 'id', 'freeze-thaw cycles': 'count'})

# splitting "project code"
df = df.join(df['project code'].str.split(',', 1, expand=True).rename(columns={0:'Project code1', 1:'Project code2'}))

# remove "project code" 
df = df.drop('project code', 1)

# Split the dataframe based on Project code1 and Project code2
df1 = df[['box number', 'rack position', 'id', 'count', 'new box number',
          'new Rack Position', 'Project code1']]
df2 = df[['box number', 'rack position', 'id', 'count', 'new box number',
          'new rack position', 'Project code2']]

# rename Project code1 and Project code2 to Project code
df1 = df1.rename(columns={'Project code1': 'Project code'})
df2 = df2.rename(columns={'Project code2': 'Project code'})

# concatenate the dataframes based on "Project code"
df = pd.concat([df1, df2], axis=0) #axis=0 for columns, axis=1 for rows

# convert the data frame into the long format
df = pd.melt(df, id_vars=['id', 'count'], var_name='field', value_name='value')

ISSUE:
The splitting of "project code" results in a number of rows with the value of 0, like this:

id; field; count; value
32; project code; 1; 0
33; project code; 1; 0
34; project code; 1; 0
35; project code; 1; 0

How do I fix this (preferably integrated into the code)? Is there a more streamline approach to this (my code seems pretty fragmented)?

Thank you!

Cheers, Birgitte

Biogitte
  • 37
  • 7
  • Possible duplicate of [Deleting DataFrame row in Pandas based on column value](https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value) –  Aug 10 '18 at 11:07
  • If you have working code and you want to know how to make it better, consider whether [CodeReview.SE] would be appropriate. –  Aug 10 '18 at 11:07
  • @JETM Not sure this is working code though. There's an issue without fix open. – Mast Aug 10 '18 at 11:14

0 Answers0