-2

So, I have a dataframe which looks like this:

enter image description here

I want to separate the values in 'Filename' column into strings based on "-" and "." and also remove the extension name. Then I want to separate the values in 'Path' column into strings based on "\" and ":". How do I do this?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly – yatu May 29 '20 at 13:46
  • 1
    Question has nothing to do with `machine-learning` or `nlp` - kindly do not spam irrelevant tags (removed.) – desertnaut May 29 '20 at 13:55

1 Answers1

0

It's not entirely clear what you're looking for here. But here's my best interpretation.

Setup:

df = pd.DataFrame({
    "Filename": ["doc-hi.txt", "oh-my-god.txt"],
    "Path": ["C:\asdf\asdf\asdf\kd.txt", "C:\asdcsc.docx"]
})

Separate strings

# "separate the values in 'Filename' column into strings based on '-' and '.' and also remove the extension name"
df["Filename_split"] = df["Filename"].apply(lambda _: os.path.splitext(_)[0]).str.split(r'\.|-')

# "separate the values in 'Path' column into strings based on '\' and ':'"
df["Path_split"] = df["Path"].str.split(r'\\|:')

Intermediate Output

    Filename        Path                    Filename_split  Path_split
0   doc-hi.txt      C:sdf\sdf\sdf\kd.txt    [doc, hi]       [C, , asdf, asdf, asdf, kd.txt]
1   oh-my-god.txt   C:sdcsc.docx            [oh, my, god]   [C, sdcsc.docx]

Combining tokens back together

To combine the list of strings back together into single strings, you str.join:

df['Filename_split'].str.join(' ')
df['Path_split'].str.join(' ')
Ian
  • 3,605
  • 4
  • 31
  • 66
  • Thank you ! Although your answer is helpful, I don't want the result to be inside a list but just general space separated strings so then I can vectorize the column using TFIDF vectorizer. –  May 29 '20 at 14:06
  • Strange that you can't use lists as input to the TFIDF Vectorizer - since what I've given you is the tokenized (1-gram) version of the string. – Ian May 29 '20 at 14:17
  • https://stackoverflow.com/questions/48671270/use-sklearn-tfidfvectorizer-with-already-tokenized-inputs – Ian May 29 '20 at 14:17
  • In any case, I've added the code to join the strings back together – Ian May 29 '20 at 14:19