Searching a file for words from a list

Question

I am trying to search for words in a file. Those words are stored in a separate list. The words that are found are stored in another list and that list is returned in the end.

The code looks like:

def scanEducation(file):
    education = []
    qualities = ["python", "java", "sql", "mysql", "sqlite", "c#", "c++", "c", "javascript", "pascal",
             "html", "css", "jquery", "linux", "windows"]
    with open("C:\Users\Vadim\Desktop\Python\New_cvs\\" + file, 'r') as file1:
    for line in file1:
        for word in line.split():
            matching = [s for s in qualities if word.lower() in s]
            if matching is not None:
                education.append(matching)
return education

First it returns me a list with bunch of empty "seats" which means my comparison isn't working?

The result (scans 4 files):

"C:\Program Files (x86)\Python2\python.exe" C:/Users/Vadim/PycharmProjects/TestFiles/ReadTXT.py
[[], [], [], [], [], [], [], [], [], ['java', 'javascript']]
[[], [], [], [], [], [], [], [], [], ['pascal']]
[[], [], [], [], [], [], [], [], [], ['linux']]
[[], [], [], [], [], [], [], [], [], [], ['c#']]

Process finished with exit code 0

The input file contains:

Name: Some Name
Phone: 1234567890
email: some@email.com
python,excel,linux

Second issue each file containes 3 different skills, but the function finds only 1 or 2. Is it also a bad comparison or do I have a different error here?

I would expect the result being a list of just the found skills without the empty places and to find all the skills in the file, not just some of them.

Edit: The function does find all the skills when I do word.split(', ') but if I would like it to be more universal, what could be a good way to find those skills if I don't know exactly what will separate them?

Try splitting around commas instead of spaces. e.g., line.split() --> line.split(",") — Checkmate, Sep 25 '16 at 07:50
Also, instead of the line 'if matching is not None:', use 'if len(Matching) != 0:' — Checkmate, Sep 25 '16 at 07:51
Thanks, It does show only the found skills without the empty ones. but still finds only 1 or 2 not all the 3 — Kiper, Sep 25 '16 at 07:53
To answer your edit question, look into regex. There's some super cool stuff you can do with regex splitting — Checkmate, Sep 25 '16 at 07:58

score 1 · Accepted Answer · edited May 23 '17 at 12:15

1

You get empty lists because None is not equal to an empty list. What you might want is to change the condition to the following:

if matching:
    # do your stuff

It seems that you're checking if a substring is present in the strings in the qualities list. Which might not be what you want. If you want to check the words on a line that appear on the qualities list, you might want to change your list comprehension to:

words = line.split()
match = [word for word in words if word.lower() in qualities]

If you're looking into matching both , and spaces, you might want to look into regex. See Split Strings with Multiple Delimiters?.

edited May 23 '17 at 12:15

Community

1
1

answered Sep 25 '16 at 07:56

krato

1,226
4
14
30

Thanks! your code returned the first skill in the file but not the rest. – Kiper Sep 25 '16 at 08:14
@Kiper I used `line.split()` which by default splits the line using spaces, if your input file uses commas, use `split(',')`. Maybe you have to look at regex if you have a variety of separators. – krato Sep 25 '16 at 08:17
Thanks alot, if i would want to combine a regex in there. It should go inside line.split(here?) or it should be separate? – Kiper Sep 25 '16 at 08:20
you might not use `str.split()` at all, maybe use something from `re` module. See http://stackoverflow.com/questions/1059559/python-split-strings-with-multiple-delimiters. In your case, maybe this would work `words = re.split('\W+', line)`. – krato Sep 25 '16 at 08:22

Checkmate · Answer 2 · 2016-09-25T08:26:03.137

1

The code should be written as follows (if I understand the desired output format correctly):

def scanEducation(file):
    education = []
    qualities = ["python", "java", "sql", "mysql", "sqlite", "c#", "c++", "c", "javascript", "pascal",
             "html", "css", "jquery", "linux", "windows"]
    with open("C:\Users\Vadim\Desktop\Python\New_cvs\\" + file, 'r') as file1:
    for line in file1:
        matching = []
        for word.lower() in line.strip().split(","):
            if word in qualities:
                matching.append(word)
        if len(matching) != 0:
            education.append(matching)
return education

edited Sep 25 '16 at 08:26

answered Sep 25 '16 at 07:56

Checkmate

1,074
9
16

is the "matching =.." line is correct? i got errors using it – Kiper Sep 25 '16 at 08:10
This is what I get for not testing my code XD. This should work, sorry about that! – Checkmate Sep 25 '16 at 08:14
Thanks, all the answers here return me only the first skill of the file, but not the other 2. am i doing something wrong? – Kiper Sep 25 '16 at 08:18
It worked for me when I ran this code. Make sure you're opening the right file. Also, put the qualities array declaration into one line. – Checkmate Sep 25 '16 at 08:22
... I forgot, you might want to strip() the line to get rid of a newline character... That might be your problem – Checkmate Sep 25 '16 at 08:26

score 1 · Answer 3 · edited Sep 25 '16 at 08:59

First of all, you are getting a bunch of "empty seats" because your condition is not defined correctly. If matching is an empty list, it is not None. That is: [] is not None evaluates to True. This is why you are getting all these "empty seats".

Seconds of all, the condition in your list comprehension is also not what you'd want. Unless I've misunderstood your goal here, the condition you are looking for is this:

[s for s in qualities if word.lower() == s]

This checks the list of qualities and will return a list that is not empty only if the word is one of the qualities. However, you since the length of this list will always be either 1 (if there's a match) or 0 (if there isn't) we can exchange it to a boolean by using python's built-in any() function:

if any(s == word.lower() for s in qualities):
    education.append(word)

I hope this helps, please don't hesitate to ask any follow-up questions if you have or tell me if I've misunderstood your goals.

For your convinevce, here is the modifed source I've used to check myself:

def scanEducation(file):
    education = []
    qualities = ["python", "java", "sql", "mysql", "sqlite", "c#", "c++", "c", "javascript", "pascal",
             "html", "css", "jquery", "linux", "windows"]
    with open(file, 'r') as file1:
        for line in file1:
            for word in line.split():
                if any(s == word.lower() for s in qualities):
                    education.append(word)
    return education

Thanks, using your code it gave me the first skill in every file but without the 2 other. — Kiper, Sep 25 '16 at 08:08

score 1 · Answer 4 · answered Sep 25 '16 at 08:29

You can also use regular expression like this:

def scan_education(file_name):
    education = []
    qualities_list = ["python", "java", "sql", "mysql", "sqlite", "c\#", "c\+\+", "c", "javascript", "pascal",
                      "html", "css", "jquery", "linux", "windows"]
    qualities = re.compile(r'\b(?:%s)\b' % '|'.join(qualities_list))
    for line in open(file_name, 'r'):
        education += re.findall(qualities, line.lower())
    return list(set(education))

score 1 · Answer 5 · answered Sep 25 '16 at 08:32

Here's a short example of using sets and a little bit of list comprehension filtering to find the common words between a text file (or as I used just a text string) and a list that you provide. This is faster and imho clearer than trying to use a loop.

import string

try:
    with open('myfile.txt') as f:
        text = f.read()
except:
    text = "harry met sally; the boys went to the park.  my friend is purple?"

my_words = set(("harry", "george", "phil", "green", "purple", "blue"))

text = ''.join(x for x in text if x in string.ascii_letters or x in string.whitespace)

text = set(text.split()) # split on any whitespace

common_words = my_words & text # my_words.intersection(text) also does the same

print common_words

Searching a file for words from a list

5 Answers5