0

I have a list of say, 10,000 strings (A). I also have a vector of words (V).

What I want to do is to modify each string of A to keep only those words in the string which are present in V and remove others.

For example, let's say first element of A is "one two three check test". And V is vector ["one", "test", "nine"]. So, the modified version of first element of A should look like "one test". The whole process needs to be repeated for every string of A. For each comparison, V will remain same.

I am doing something like following (this could have some bugs, but I just want to give an idea about how I am approaching the problem).

for i in range(len(A)):

    a = []

    text = nltk.word_tokenize(A[i])

    for i in range(len(text)):
        if text[i] in V:
            a.append(text[i])

    a = " ".join(a)

    A['modified_string'][i] = a

Above way is very slow and inefficient. How can I achieve it in a fast and efficient manner?

user3664020
  • 2,980
  • 6
  • 24
  • 45
  • For each element you could use something like `' '.join(filter(lambda x: x in stopwords, element.split()))` if you don't have to worry about capitalization or anything like that. – miradulo Feb 17 '16 at 12:15
  • Why not make both "vectors" a set and do set difference? [Here](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) is a relevant question. – Michael Foukarakis Feb 17 '16 at 12:29

4 Answers4

0

for single A[0] item

' '.join(set(A[0].split(' ')).intersection(V))
poko
  • 575
  • 2
  • 8
0

Here is my attempt:

>>> A = ["aba reer sdasd bab", "adb bab ergekj aba erger"]
>>> V = ["aba","bab"]
>>> map((lambda z: ' '.join(z)), map((lambda x: filter(lambda y: y in V, x.split())), A))
['aba bab', 'bab aba']

The complexity is pretty bad, but to improve it You would have to give us more details like how long is the V compared to elements of A, do You want the words to be in original order after the selection etc. It could be done faster using sets but the words wouldn't be in original order.

Tony Babarino
  • 3,355
  • 4
  • 32
  • 44
0

learn about

A = ["one two three check test", "one nine six seven", "one two six seven"]  
A_modified = list()  
V = ["one", "test", "nine"] 
V_set = set(V)  
for line in A:  
    text = set(line.split()) # or use NLTK, here I just wanted something that runs on all installs  
    A_modified.append(list(text.intersection(V_set))) 

Note that line = list(text.intersection(V_set)) will NOT work because of immutability

Edit:

Scope creep:-) Your original question wasn't specific enough, but if you want to keep the order as well as the non-unique elements, I'd do it with list comprehension:

for line in A:  
    A_modified += [[word for word in line.split() if word in V]]
Hugues Fontenelle
  • 5,275
  • 2
  • 29
  • 44
  • 1
    Hi, this is working. But this is changing the order of the words in the output. Also, if any word is appear multiple times in the input string, output displays it once only. Any workaround for this? – user3664020 Feb 18 '16 at 13:26
0

Sets seem to be the appropriate data structures here:

A = ["aba reer sdasd bab", "adb bab ergekj aba erger", "aba", "bab" ]
V = ["aba","bab"]

vset = set(V)
for i in A:
    print tuple(set(i.split()).intersection(vset))
boardrider
  • 5,882
  • 7
  • 49
  • 86