How to separate strings in a dataframe column based on a delimiter?

Question

So, I have a dataframe which looks like this:

I want to separate the values in 'Filename' column into strings based on "-" and "." and also remove the extension name. Then I want to separate the values in 'Path' column into strings based on "\" and ":". How do I do this?

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly — yatu, May 29 '20 at 13:46
Question has nothing to do with `machine-learning` or `nlp` - kindly do not spam irrelevant tags (removed.) — desertnaut, May 29 '20 at 13:55

Ian · Accepted Answer · 2020-05-29T14:18:44.613

0

It's not entirely clear what you're looking for here. But here's my best interpretation.

Setup:

df = pd.DataFrame({
    "Filename": ["doc-hi.txt", "oh-my-god.txt"],
    "Path": ["C:\asdf\asdf\asdf\kd.txt", "C:\asdcsc.docx"]
})

Separate strings

# "separate the values in 'Filename' column into strings based on '-' and '.' and also remove the extension name"
df["Filename_split"] = df["Filename"].apply(lambda _: os.path.splitext(_)[0]).str.split(r'\.|-')

# "separate the values in 'Path' column into strings based on '\' and ':'"
df["Path_split"] = df["Path"].str.split(r'\\|:')

Intermediate Output

    Filename        Path                    Filename_split  Path_split
0   doc-hi.txt      C:sdf\sdf\sdf\kd.txt    [doc, hi]       [C, , asdf, asdf, asdf, kd.txt]
1   oh-my-god.txt   C:sdcsc.docx            [oh, my, god]   [C, sdcsc.docx]

Combining tokens back together

To combine the list of strings back together into single strings, you str.join:

df['Filename_split'].str.join(' ')
df['Path_split'].str.join(' ')

edited May 29 '20 at 14:18

answered May 29 '20 at 13:58

Ian

3,605
4
31
66

Thank you ! Although your answer is helpful, I don't want the result to be inside a list but just general space separated strings so then I can vectorize the column using TFIDF vectorizer. – May 29 '20 at 14:06
Strange that you can't use lists as input to the TFIDF Vectorizer - since what I've given you is the tokenized (1-gram) version of the string. – Ian May 29 '20 at 14:17
https://stackoverflow.com/questions/48671270/use-sklearn-tfidfvectorizer-with-already-tokenized-inputs – Ian May 29 '20 at 14:17
In any case, I've added the code to join the strings back together – Ian May 29 '20 at 14:19

How to separate strings in a dataframe column based on a delimiter?

1 Answers1