-2

I have a list of lists formatted as:

testing_set = ["001,P01", "002,P01,P02", "003,P01,P02,P09", "004,P01,P03"]

I used re to reformat the list as such:

[in] test_set1 = [ re.split(r',', line, maxsplit=5) for line in testing_set]

[out] ["001","P01"]

How can I create a dataframe where the index is (transaction_id) "001,002,003,004" and the p-values for each line are listed in the column (product_id).

DJK
  • 8,924
  • 4
  • 24
  • 40
zsad512
  • 861
  • 3
  • 15
  • 41

1 Answers1

0

This can be done like this,

testing_set = ["001,P01","002,P01,P02","003,P01,P02,P09","004,P01,P03"]

test_set1 = [re.split(r',', line, maxsplit=1) for line in testing_set]
#change maxsplit to 1______________________^

df =pd.DataFrame(test_set1,columns=['transaction_id','product_id'])
df.set_index(['transaction_id'],inplace=True)
df['product_id'] = df['product_id'].apply(lambda row: row.split(','))

Which gives you a Dataframe like this

                     Product_id
transaction_id                 
001                       [P01]
002                  [P01, P02]
003             [P01, P02, P09]
004                  [P01, P03]
DJK
  • 8,924
  • 4
  • 24
  • 40
  • how can I further split it so that each P-value is a seperate string, but still on the same line? so that 002 would have two Product_Id strings instead of one? Also how can I label the index as "transaction_id"? – zsad512 Jul 29 '17 at 19:03
  • There is a typo in 'code'(df.set_idex(['transaction_id'],inplace=True])) being as there is an extra ] but the code worked, thank you! Now- I have to create a matrix based on this Dataframe, with 1's if a product is in a particular basket and 0's otherwise (for columns "P1-P10") do you know how I could do that? – zsad512 Jul 29 '17 at 19:47
  • 1
    That's really another question entirely, I would look into one [hot encoding](http://pbpython.com/categorical-encoding.html) – DJK Jul 29 '17 at 19:54