How do I parse text file in python to create a sorted list with duplicates removed

Question

Here is four lines of a sample text file...

The Star Schema is the simplest style of data mart schema

The star schema consists of one or more fact tables referencing any number of dimension tables

Pay attention to bogus schema

Cheers

A python code should create a sorted list in an alphabetic order as shown below with duplicate words removed and capital words come sorted first.

Final output like this...

[ “Cheers”, “Pay”, “Schema”, “Star”, “The”, “any”, “bogus”,………..]

is `schema != Schema`? – Padraic Cunningham Mar 21 '15 at 21:48 — Padraic Cunningham, Mar 21 '15 at 21:48

matino · Answer 1 · 2015-03-21T21:39:43.690

1

You can use sorted(s.split()) to sort a string the way you want to:

>>> s = 'The Star Schema is the simplest style of data mart schema'
>>> sorted(s.split())
['Schema', 'Star', 'The', 'data', 'is', 'mart', 'of', 'schema', 'simplest', 'style', 'the']

For removing the duplicates you can use set, however set is unordered, therefore you need to convert it into the list again (which will be done implicitly by sorted):

sorted(set(s.split()))

should be a final answer.
How to read string from files should be pretty easy.

edited Mar 21 '15 at 21:39

answered Mar 21 '15 at 21:26

matino

17,199
8
49
58

1

`sorted(set(s.split(' ')))` should be enough. – Aristide Mar 21 '15 at 21:31
1

You don't need to specify the `" "`, you would also want to call lower on the string before splitting if Schema == schema. – Padraic Cunningham Mar 21 '15 at 21:38
You could also `str.rstrip(string.punctuation)` to remove the punctuaution from the words. – Padraic Cunningham Mar 21 '15 at 21:41
I just learned that innstead of sort() sorted() works. Thank you for your candid response. I will give it a try. – Samuel Mar 22 '15 at 00:16

score 0 · Accepted Answer · answered Mar 21 '15 at 21:30

0

Steps you should follow:

read the file into a list of lines
parse the lines into words, and add them to your list
remove duplicates
sort

This code should do it:

import re # use this library for splitting off words

all_words = [] # initialize list to store the words

with open('my_file.txt') as f: # best way to open a file
   for line in f:
       line = line.strip() # remove trailing newline
       words = re.split(r'\W+', line) # split the line into words even when you have punctuation
       all_words += words

# looping is done now, and all lines have been read

all_words = set(all_words) # remove duplicates
all_words = sorted(all_words) # sort (capitalized words will come first)

answered Mar 21 '15 at 21:30

Harold Ship

989
1
8
14

Excellent response. It worked like a magic. I wish I can think like you do. Please watch out for more questions, I amy count on you for a solution like this. However, How can I solve without using "import re" library? How do I make the best use of append() to append all words to the list? – Samuel Mar 22 '15 at 00:10
Instead of "all_words+=words" in your above code why this does not work ---> all_words= all_words.append(words)? I reiterate my question, how do I use append() method to append all the words in the list as I read and parse each line. Just learning from the master for education purpose. Thanks. – Samuel Mar 22 '15 at 01:05
@Samuel `all_words.append(words)` would add the *list* `words` as a member of `all_words`. What you want is to add *each element* of `words` into `all_words`. You can do this with `all_words.extend(words)` There's a good example right here: [http://stackoverflow.com/questions/252703/python-append-vs-extend](http://stackoverflow.com/questions/252703/python-append-vs-extend) – Harold Ship Mar 22 '15 at 05:22
Harold, pardon me for writing back to you. It did work but this is what I was looking for. My sincere apology, I should have been explicit from the beginning. This is what I was looking for.....Of course, open a text file to read; create an empty list and populate this list as I read line by line entire file. If the first line I read has duplicates in it then remove it and split the line into words and put it in an empty list. Now read second line and if the word exist in the list then remove it since we don't want duplicates and if it is new then append it to the list and finally sort, print. – Samuel Mar 23 '15 at 17:59

How do I parse text file in python to create a sorted list with duplicates removed

2 Answers2