1

I've been searching Stack for the answer to this question, and while the solutions presented here and here, for example, logically make sense I can't for the life of me implement them.

I am very new to Python, and while I know that I can do this in Excel super quick, I want to learn how to do it in Python so I'm not relying on Excel in the future.

Here's my current code (these are spread across different cells for my benefit when learning what I've written so I apologise if they're a little jarring to read):

## SECOND STEP: IMPORT CSVs INTO DATA FRAMES 
# import module
import pandas as pd
  
# read datset
df1 = pd.read_csv("./csvs/Data1.csv")
df2 = pd.read_csv("./csvs/Data2.csv")

## FOURTH STEP - MERGE DATA FRAMES INTO 1 DATA SET
# Merging df1 and df2 with merge function with the common column as Name
# We use a Left join as DF2 contains the additional information we need in DF1
df3 = pd.merge(df1, df2, on='Title', how="left")


## FIFTH STEP - SPLIT COLUMN 'GENRE'
pd.concat([df3[[0]], df3['Genres'].str.split(',', expand=True)], axis=1)

The merged data from the 4th Step would look something like this (Basic Table Example with relative Column Headers):

enter image description here

I'm sure that what I'm doing wrong is fixable, but I would really appreciate the help to figure out why?

H Sa
  • 128
  • 9
Aemonar
  • 63
  • 1
  • 9

1 Answers1

0
import pandas as pd

# generate test data
df = pd.DataFrame(
    {
        'A': np.random.choice(100, 3),
        'B': ['a,b,c', 'x,y', 'q,r'],
        'C': [1,2,5]
    })
print(df)
print('------')

# concat A and C columns, with splitted column B
# store the output in new dataframe
df2 = pd.concat([df[['A','C']], df['B'].str.split(',', expand=True)], axis=1)
print(df2)

Output:

    A      B  C
0  28  a,b,c  1
1   4    x,y  2
2   7    q,r  5
------
    A  C  0  1     2
0  28  1  a  b     c
1   4  2  x  y  None
2   7  5  q  r  None
S2L
  • 1,746
  • 1
  • 16
  • 20
  • I've not seen a code like that before (newbie!) so just to make sure I understand it, can I break it down? `df = pd.DataFrame( { 'A': np.random.choice(100, 3), 'B': ['a,b,c', 'x,y', 'q,r'] })` is essentially a shorthand way of creating a dataframe from two different sources (like what I have in the 2nd Step)? If so, that makes your amendment and updated code `df2 = pd.concat([df['A'], df['B'].str.split(',', expand=True)], axis=1)` make sense to me. – Aemonar Aug 15 '21 at 05:16
  • Tried to amend my code to incorporate yours and it failed so I mustn't be getting it. `# import datset df = pd.DataFrame( { 'A': pd.read_csv("./csvs/Data1.csv"), 'B': pd.read_csv("./csvs/Data2.csv") })` – Aemonar Aug 15 '21 at 05:22
  • I included the full example, in case if you (or someone later) wants to try it out. You only need to modify the pd.concat line. I will try to post more explanation soon. – S2L Aug 15 '21 at 05:26
  • Note that pd.concat is being sent a dataframe as first parameter which contains all columns except the one you want to split. – S2L Aug 15 '21 at 05:30
  • Ah I may have not been clear. DF3 in my code is the final dateframe that I want to use, and I want to split the Genres column in DF3 into separate columns that are then added back into DF3. Looking at your code and what you've said, I can fully see that my code is not going to do what I wanted it to do, so is there a way to split the column in the existing DF3 and then add the entries as separate columns (like Genres1, Genres2, etc.) back into it? – Aemonar Aug 15 '21 at 05:47
  • You can assign back output of concat to df3. But remember that pd.concat returns a new dataframe. – S2L Aug 15 '21 at 05:48
  • I went with the new DF and your suggestion totally worked, thank you! – Aemonar Aug 15 '21 at 06:09