I have a csv file with two columns, one with the name of a person and the other with words defined by the person, the problem is that in this column there are many words that are separated by punctuation marks. I need to separate these words so that each person only has one word per column, that is:
name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"
These are an example of the CSV data. As you can see, there are data separated by different punctuation marks (there are still some more) where they are separated by a space and others that are not. The result I would like to achieve is:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,stricker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics
I have opened the file using pandas where I have created a dataframe with the data and this is where I have to separate the data. What I have tried is:
def splitter(df):
df['word'] = df['word'].str.split(",")
df = df.explode("word")
df['word'] = df['word'].str.split(", ")
df = df.explode("word")
df['word'] = df['word'].str.split(" , ")
df = df.explode("word")
df['word'] = df['word'].str.split("- ")
df = df.explode("word")
df['word'] = df['word'].str.split(" -")
df = df.explode("word")
df['word'] = df['word'].str.split("\. ")
df = df.explode("word")
df['word'] = df['word'].str.split(";")
df = df.explode("word")
df['word'] = df['word'].str.split("; ")
df = df.explode("word")
df['word'] = df['word'].str.split(" ;")
df = df.explode("word")
df['word'] = df['word'].str.split(" ; ")
df = df.explode("word")
df['word'] = df['word'].str.split("/ ")
df = df.explode("word")
return df
The result I get is the one I want but with some spaces and they don't have to appear:
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom, stricker
Tom, ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara, nba
Sara, ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian, medicine
Sergi,runner
Sergi, athletics
How could I solve this problem and improve the code I have put in, since I do not know how to modify it so that everything works correctly?