How to split CSV content by punctuation marks

Question

I have a csv file with two columns, one with the name of a person and the other with words defined by the person, the problem is that in this column there are many words that are separated by punctuation marks. I need to separate these words so that each person only has one word per column, that is:

name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"

These are an example of the CSV data. As you can see, there are data separated by different punctuation marks (there are still some more) where they are separated by a space and others that are not. The result I would like to achieve is:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,stricker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics

I have opened the file using pandas where I have created a dataframe with the data and this is where I have to separate the data. What I have tried is:

def splitter(df):

    df['word'] = df['word'].str.split(",")
    df = df.explode("word")

    df['word'] = df['word'].str.split(", ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" , ")
    df = df.explode("word")

    df['word'] = df['word'].str.split("- ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" -")
    df = df.explode("word")

    df['word'] = df['word'].str.split("\. ")  
    df = df.explode("word")

    df['word'] = df['word'].str.split(";")
    df = df.explode("word")

    df['word'] = df['word'].str.split("; ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" ;")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" ; ")
    df = df.explode("word")

    df['word'] = df['word'].str.split("/ ")
    df = df.explode("word")

return df

The result I get is the one I want but with some spaces and they don't have to appear:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom, stricker
Tom, ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara, nba
Sara, ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian, medicine
Sergi,runner
Sergi, athletics

How could I solve this problem and improve the code I have put in, since I do not know how to modify it so that everything works correctly?

I tried it together with `regex` but it didn't work. For this reason I tried with `split()`. — drakitus98, Jan 20 '21 at 21:18
did you try to use `strip()` after all your code ? - ie, `df.str.strip()` . — furas, Jan 20 '21 at 21:19

furas · Answer 1 · 2021-01-22T14:29:41.590

Simply

df['word'] = df['word'].str.strip()

and it should remove all spaces, tabs and new lines from both sides of text.

BTW:

Probably you could even use split(";") without split("; "), split(" ;"), etc. because strip() will remove these spaces.

If you want to use variants like split(";"), split("; "), split(" ;"), split(" ; ") then you should start with the longest split(" ; ") and later use shorter split("; "), split(" ;") and at the end the shortest split(";") - and this way maybe you could remove spaces.

You could even try to use only one split('[;,-./]') instead all of split()

df = df['word'].str.split('[;,-./]').explode().str.strip()

Eventually you could use | as OR

EDIT:

Minimal working example with data directly in code - so everyone can test it.

import pandas as pd
import io

text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"'''

# text to dataframe
df = pd.read_csv(io.StringIO(text))

df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()

# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read() 

print(text)

Result:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics

EDIT:

The same without strip().

I use ' ?' to get optional space after chars ;,/ and before char .

I also use ' - ' before '- ' and ' -' to find the longest version.

df['word'] = df['word'].str.split('[;,/] ?| ?\. | - |- | -')
df = df.explode('word')

EDIT:

Example which use replacements to keep (data, science) as one string without spliting.

import pandas as pd
import io

text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball; (date1, time1)"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine - (date2, time2) - date3, time3"
Sergi,"runner . athletics"'''

# text to dataframe
df = pd.read_csv(io.StringIO(text))


# Find all `(...)`
found = df['word'].str.findall(r'\(.*?\)')
print(found)

# Flatten it
found = sum(found, [])
print(found)

# Create dict to put pattern in place of `(...)`.
# Because later I will use `regex=True` so I have to use `\(...\)` instead of `(...)`
patterns = {f'\({value[1:-1]}\)':f'XXX{i}' for i, value in enumerate(found)}
print(patterns)

df['word'] = df['word'].replace(patterns, regex=True)

# --- nromal spliting ---


df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()

# Create dict to put later `(...)` in place of pattern.
patterns_back = {f'XXX{i}':value for i, value in enumerate(found)}
print(patterns_back)

df['word'] = df['word'].replace(patterns_back, regex=True)

# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read() 

print(text)

Result:

0                  []
1                  []
2                  []
3    [(date1, time1)]
4                  []
5    [(date2, time2)]
6                  []
Name: word, dtype: object

['(date1, time1)', '(date2, time2)']

{'\\(date1, time1\\)': 'XXX0', '\\(date2, time2\\)': 'XXX1'}

{'XXX0': '(date1, time1)', 'XXX1': '(date2, time2)'}

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Sara,"(date1, time1)"
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Christian,"(date2, time2)"
Christian,date3
Christian,time3
Sergi,runner
Sergi,athletics

Thank you very much, this explanation has been of great help to me! — drakitus98, Jan 21 '21 at 10:21
I have a question that I cannot solve. There are some words that are in parentheses such as: (data, science). The problem is that it also separates them and I don't want that. I have tried to put some kind of condition to solve it but there is no way to do it, how could I solve this problem? — drakitus98, Jan 22 '21 at 11:50
I found similar questions which use very complex regex and they split only on space - so it would need many changes: [Splitting on spaces, except between certain characters](https://stackoverflow.com/questions/9644784/splitting-on-spaces-except-between-certain-characters). [Split on spaces not inside parentheses in Python](https://stackoverflow.com/questions/42070323/split-on-spaces-not-inside-parentheses-in-python). — furas, Jan 22 '21 at 13:58
Other idea is: first find all substrings with parentheses and remeber them in dictionary with unique keys - ie. `XXX1` for first substring, `XXX2` for second substrings, etc. - next put these keys in original text in place of substrings. Next use normal `split()`. And finally you have to put original substrings back in place of `XXX1`, `XXX2`, etc. — furas, Jan 22 '21 at 14:00

score 0 · Answer 2 · answered Jan 20 '21 at 22:32

I do not know much about pandas, but perhaps the following code is helpful for you.

import re

# [name,word]
data = [["Oliver", "water,surf,windsurf"],
        ["Tom", "football, striker, ball"],
        ["Anna", "mountain;wind;sun"],
        ["Sara", "basketball; nba; ball"],
        ["Mark", "informatic/web3.0/e-learning"],
        ["Christian", "doctor - medicine"],
        ["Sergi", "runner . athletics"]]

result = []

for item in data:
    words = re.split(r'\s*;\s*|\s*,\s*|/|\s+-\s+|\s+.\s+', item[1])
    result.extend([(item[0], w) for w in words])

You can split the words with the re-module. Then you get the result in a list of tuples.

How to split CSV content by punctuation marks

2 Answers2