0

Hi I have a pandas dataframe and a text file that look a little like this:

df:
+----------------------------------+
|           Description            |
+----------------------------------+
| hello this is a great test $5435 |
| this is an432 entry              |
| ...                              |
| entry number 43535               |
+----------------------------------+

txt:
word1
word2
word3
...
wordn

The descriptions are not important.

I want to go through each row in the df split by ' ' and for each word if the word is in text then keep it otherwise delete it.

Example:

Suppose my text file looks like this

hello
this
is
a
test

and a description looks like this

"hello this is a great test $5435"

then the output would be hello this is a test because great and $5435 are not in text.

I can write something like this:

def clean_string(rows):
    for row in rows:
        string = row.split()
        cleansed_string = []
        for word in string:
            if word in text:
                cleansed_string.append(word)
        row = ' '.join(cleansed_string)

But is there a better way to achieve this?

user9940344
  • 574
  • 1
  • 8
  • 26

1 Answers1

1

Use:

with open('file.txt', encoding="utf8") as f:
    L = f.read().split('\n')

print (L)
['hello', 'this', 'is', 'a', 'test']

f = lambda x: ' '.join(y for y in x.split() if y in set(L))
df['Description'] = df['Description'].apply(f)

For improve performance:

s = set(L)
df['Description'] = [' '.join(y for y in x.split() if y in s) for x in df['Description']]

print (df)
            Description
0  hello this is a test
1               this is
2                      
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, I am getting the following error `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 180050: character maps to ` I suppose this is due to one of the words in text file that can't be recognized, how can I make it recognize the character or else just ignore it? I tried adding `encoding="utf8"` but got the following `'utf-8' codec can't decode byte 0x82 in position 7012: invalid start byte`. – user9940344 Oct 07 '19 at 11:09
  • @user9940344 - Can you check [this](https://stackoverflow.com/a/48556203) ? – jezrael Oct 07 '19 at 11:28
  • yes that worked, this method will work I think it is just very very slow (I have 250k descriptions and 500k words in my text file), are there any other solutions? – user9940344 Oct 07 '19 at 11:31
  • @user9940344 - list comprehension should be faster, answer was edited. – jezrael Oct 07 '19 at 11:37
  • seems faster which is good however the output is just a df with column description but the values are blanks? – user9940344 Oct 07 '19 at 11:42
  • @user9940344 - For me it working nice with sample data, can you test if data matching between text file and splitted values in column? – jezrael Oct 07 '19 at 11:43
  • thank you this works great (it was an issue with the cases which I fixed). Wonderful solution. One last question, is there a way I can track progress if I am running this on a lot of data for instance it prints to the console which column it is on? I have used tqdm with pandas operations but I am not sure for list comprehensions, by the way I will accept this answer when it allows! – user9940344 Oct 07 '19 at 11:47
  • @user9940344 - I think you can use `df['Description'] = [' '.join(y for y in x.split() if y in s) for x in tqdm(df['Description'])]` or if not working ``df['Description'] = [' '.join(y for y in x.split() if y in s) for x in tqdm(df['Description'].tolist())]`` – jezrael Oct 07 '19 at 12:14