1

I was given a long .txt file that when read returns one long string that is a large corpus of words that are separated by \n as shown:

\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nworth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n

I need to split this string into a list of these words but none of the commands I usually use for .csv files is working. I have tried stripping, replacing(), split(), splitline() and nothing will break this into a list of these words. I would be grateful for any assistance.

punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '[',']','@']
punctuation_chars2=["'", '"', ",", ".", "!",":",";",'#','[',']','@','\n']
    # list of positive words to use
    positive_words = []
    wrd_list = []
    new_list = []
    with open("positive_words.txt", 'r', encoding="utf-16") as pos_f:
        for lin in pos_f:
            if lin[0] != ';' and lin[0] != '\n':
                positive_words.append(lin.strip())

        pos_wrds = positive_words[0]
        pos_wrds.strip()
        print(pos_wrds)
        for p in punctuation_chars:
            pos_wrds = pos_wrds.replace(p,"")
        print(pos_wrds)


wrd_list = pos_wrds.splitlines()
new_list = wrd_list[-1].splitlines

I would like to see a python list with each word separated: list = [a+, abound, abounds, abundance, abundant...]

Prune
  • 76,765
  • 14
  • 60
  • 81
chaza68
  • 113
  • 2
  • 8
  • 1
    You mean `pos_f.read().split('\n')` doesn't work...? – gmds Apr 08 '19 at 22:38
  • `\n` is first char in your text so you can cut it from text - `text[0]` - and use with `split(text[0])`. Maybe it is not the same char as `'\n'`. Some time ago Windows was using `"\r\n"`, Linux was using `"\n"` and Mac was using `"\r"` You could also check code of this char `ord(text[0])` and compare with `ord("\n")` – furas Apr 08 '19 at 22:39
  • 1
    `alist = list(open("my_file.txt"))` – Joran Beasley Apr 08 '19 at 22:42
  • 3
    if you see `\n` in text then it is not `"new line"` but normal text `"\\n"` - try `split("\\n")` – furas Apr 08 '19 at 22:43
  • Right: you've tried to split on the `newline` character, but your file apparently contains the individual characters `\\` and 'n'. – Prune Apr 08 '19 at 22:46
  • Thank you thank you all! I really appreciate this guidance. What worked was split("\\n"). I kept trying to replace it as regular txt with a space but that didn't work. I have almost exclusively worked with numbers and data for my entire life so working with .txt strings is challenging for me and why I am studying it now. – chaza68 Apr 08 '19 at 23:33

2 Answers2

2

splitlines works pretty well:

In [1]: text = "\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nw
   ...: orth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n"                                                                                                                 

In [2]: text.splitlines()                                                                                                                                                                                                                 
Out[2]: 
['',
 'a+',
 'abound',
 'abounds',
 'abundance',
 'abundant',
 'accessable',
 'accessible',
 'acclaim',
 'acclaimed',
 'acclamation',
 'accolade',
 'accolades',
 'accommodative',
 'accomodative',
 'accomplish',
 'accomplished',
 'accomplishment...',
 'world-famous',
 'worth',
 'worth-while',
 'worthiness',
 'worthwhile',
 'worthy',
 'wow',
 'wowed',
 'wowing',
 'wows',
 'yay',
 'youthful',
 'zeal',
 'zenith',
 'zest',
 'zippy']
Prashanti
  • 174
  • 7
  • I don't understand why this is not working for me. I have tried this in my code and it didn't result in a list like your response above. – chaza68 Apr 08 '19 at 23:34
0

string.splitlines() work on the lines of Python text file.

A Python text file is an ordered collection (sequence) of lines. Each line is a string terminated with "\n". So using positive_words.append(lin.split('\\n')) works because for your file you must escape the backslash character for it to be treated as a backslash and not as a newline "\n" character.

'''
print('\na+\nabound\nabounds\nabundance\nabundant\naccessable\naccessible\nacclaim\nacclaimed\nacclamation\naccolade\naccolades\naccommodative\naccomodative\naccomplish\naccomplished\naccomplishment...\nworld-famous\nworth\nworth-while\nworthiness\nworthwhile\nworthy\nwow\nwowed\nwowing\nwows\nyay\nyouthful\nzeal\nzenith\nzest\nzippy\n')
'''

# punctuation_chars = ["'", '"', ",", ".", "!", ":", ";", '#', '[',']','@']
# punctuation_chars2=["'", '"', ",", ".", "!",":",";",'#','[',']','@','\n']

# list of positive words to use
positive_words = []
wrd_list = []
new_list = []
with open("positive_words.txt", 'r', encoding="utf-8") as pos_f:
    for lin in pos_f:
        positive_words.append(lin.split('\\n'))

    pos_wrds = positive_words[0]

print(pos_wrds)

#    for p in punctuation_chars:
#        pos_wrds = pos_wrds.replace(p,"----")
#    print(pos_wrds)

# wrd_list = pos_wrds.splitlines(0)
# new_list = wrd_list[-1].splitlines()

Your last 6 lines need to be modified, because they are using string methods on a list, which is throwing errors.

You need to test for punctuation and non-alphanumeric characters explicitly, because your file has punctuation in one element "accomplishment..." and "a+" in another.

Test each list item separately as a string in the pos_wrds list. Also, your punctuation list has "\n" and "@", which are control characters and special characters (technically not punctuation characters).

If you really need to test for punctuation, then use the Python string package to test for characters in the punctuation character set.

See Best way to strip punctuation from a string in Python for more information on the String library. It is awesomely powerful !!

Rich Lysakowski PhD
  • 2,702
  • 31
  • 44