0

I have a csv with msg column and it has the following text

muchloveandhugs                                  
dudeseriously                                    
onemorepersonforthewin                           
havefreebiewoohoothankgod                        
thisismybestcategory                             
yupbabe                                          
didfreebee                                       
heykidforget                                     
hecomplainsaboutit                               

I know that nltk.corpus.words has a bunch of sensible words. My problem is how do I iterate it over the df[‘msg’] column so that I can get words such as

df[‘msg’]
much love and hugs
dude seriously
one more person for the win
Questions
  • 75
  • 2
  • 7
  • The problem is broad and not well defined. For example, is `someone` one word or `some one`? You should share your existing code so there's somewhere to start with. – jpp Oct 15 '18 at 14:44
  • 1
    This is a complicated problem and prone to error since it relies heavily on probability. I found [this link](http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb#(5)-Task:-Word-Segmentation) that suggests an approach. Personally, I'd be tempted to just ask Google; it will split up such strings and offer a "do you mean" link. – kindall Oct 15 '18 at 14:48

1 Answers1

2

From this question about splitting words in strings with no spaces and not quite knowing what your data looks like:

import pandas as pd
import wordninja

filename = 'mycsv.csv' # Put your filename here

df = pd.read_csv(filename)
for wordstring in df['msg']:
    split = wordninja.split(wordstring)
    # Do something with split
Stephen C
  • 1,966
  • 1
  • 16
  • 30