keep words present in a given vector and remove others

Question

I have a list of say, 10,000 strings (A). I also have a vector of words (V).

What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others.

For example, let's say first element of A is "one two three check test". And V is vector ["one", "test", "nine"]. So, the modified version of first element of A should look like "one test". The whole process needs to be repeated for every string of A. For each comparison, V will remain same.

I am doing something like following (this could have some bugs, but I just want to give an idea about how I am approaching the problem).

for i in range(len(A)):

    a = []

    text = nltk.word_tokenize(A[i])

    for i in range(len(text)):
        if text[i] in V:
            a.append(text[i])

    a = " ".join(a)

    A['modified_string'][i] = a

Above way is very slow and inefficient. How can I achieve it in a fast and efficient manner?

For each element you could use something like `' '.join(filter(lambda x: x in stopwords, element.split()))` if you don't have to worry about capitalization or anything like that. — miradulo, Feb 17 '16 at 12:15
Why not make both "vectors" a set and do set difference? [Here](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) is a relevant question. — Michael Foukarakis, Feb 17 '16 at 12:29

poko · Answer 1 · 2016-02-17T12:24:20.477

0

for single A[0] item

' '.join(set(A[0].split(' ')).intersection(V))

edited Feb 17 '16 at 12:24

answered Feb 17 '16 at 12:17

poko

575
2
8

Tony Babarino · Answer 2 · 2016-02-17T12:43:05.637

Here is my attempt:

>>> A = ["aba reer sdasd bab", "adb bab ergekj aba erger"]
>>> V = ["aba","bab"]
>>> map((lambda z: ' '.join(z)), map((lambda x: filter(lambda y: y in V, x.split())), A))
['aba bab', 'bab aba']

The complexity is pretty bad, but to improve it You would have to give us more details like how long is the V compared to elements of A, do You want the words to be in original order after the selection etc. It could be done faster using sets but the words wouldn't be in original order.

Hugues Fontenelle · Answer 3 · 2016-02-19T14:02:54.757

learn about

for loops. Python is not C, you usually don't need the "i" variable (http://www.tutorialspoint.com/python/python_loop_control.htm)
sets. Useful for intersections (https://docs.python.org/2/library/sets.html)
the fact that you can't modify the list in place (immutable) therfore you need to initialize a new list, and append elements to it.

A = ["one two three check test", "one nine six seven", "one two six seven"]  
A_modified = list()  
V = ["one", "test", "nine"] 
V_set = set(V)  
for line in A:  
    text = set(line.split()) # or use NLTK, here I just wanted something that runs on all installs  
    A_modified.append(list(text.intersection(V_set)))

Note that line = list(text.intersection(V_set)) will NOT work because of immutability

Edit:

Scope creep:-) Your original question wasn't specific enough, but if you want to keep the order as well as the non-unique elements, I'd do it with list comprehension:

for line in A:  
    A_modified += [[word for word in line.split() if word in V]]

Hi, this is working. But this is changing the order of the words in the output. Also, if any word is appear multiple times in the input string, output displays it once only. Any workaround for this? — user3664020, Feb 18 '16 at 13:26

score 0 · Answer 4 · answered Feb 18 '16 at 17:08

0

Sets seem to be the appropriate data structures here:

A = ["aba reer sdasd bab", "adb bab ergekj aba erger", "aba", "bab" ]
V = ["aba","bab"]

vset = set(V)
for i in A:
    print tuple(set(i.split()).intersection(vset))

answered Feb 18 '16 at 17:08

boardrider

5,882
7
49
86

keep words present in a given vector and remove others

4 Answers4