0

i wrote a function to remove an image link from text data (strings stored in pandas)

image_link_1 = 'â\x80¦IMAGEâ\x80¦' 
image_link_2 = 'IMAGE'

def remove_image(text):
    remove_im = ''.join([i for i in text if i not in image_link_1 and image_link_2])
    return remove_im

df['title_and_abstract'] = df['title_and_abstract'].apply(lambda x: remove_image(x))

The problem is , that the function removes the first letter of some string. Espcially it seems that the function removes capital letter only. Weird.

Here´s an example

'This is an example string. Here is the IMAGE.'

after the function is used:

'his is an example string. Here is the .'

I realy dont get why this function does that.

Thank you in advance!

Epimetheus
  • 373
  • 2
  • 17
  • 1
    Does this answer your question? [Why does \`a == b or c or d\` always evaluate to True?](https://stackoverflow.com/questions/20002503/why-does-a-b-or-c-or-d-always-evaluate-to-true) – quamrana Dec 15 '20 at 10:45
  • 1
    did you try leaving space between ""? `remove_im = ' '.join([i for i in text if i not in image_link_1 and image_link_2])` – Ali Ülkü Dec 15 '20 at 10:45
  • 1
    Unrelated Note: you can (and probably should) replace `lambda x: remove_image(x)` with just `remove_image`. – Roy Cohen Dec 15 '20 at 10:57
  • @AliÜlkü yes, i did – Epimetheus Dec 15 '20 at 11:08
  • @RoyCohen . Thanks for your reply. I would be interested in a short explanation if you can spare some time :) – Epimetheus Dec 15 '20 at 11:13
  • @Epimetheus When writing `remove_image` (without the parentheses) you're reffering to a function object. That object, when called, runs the body of the function. When writing `lambda x: remove_image(x)` you're reffering to an anonymous function object. That object, when called, runs the body of the lambda expression, which calles the function. So both options, when called, will run the function. – Roy Cohen Dec 15 '20 at 11:29

2 Answers2

1
  • for i in text breaks the text into individual characters; if you want words, that would be for i in text.split()
  • and image_link_2 checks whether image_link_2 is non-empty, which is always true; what you probably want is if i not in [image_link_1, image_link_2]

Hopefully these will help you get unstuck?

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • Thanks for your fast reply : If i use the suggest form: if i not in [image_link_1, image_link_2] the links dont get removed at all. – Epimetheus Dec 15 '20 at 11:11
  • 1
    Yeah, you'll probably need to combine them. You may want to print out some of the intermediate results, so you understand better what the code is doing... – Jiří Baum Dec 15 '20 at 11:21
  • just because im curious. If one has 10+ words to remove, it looks infeasible to define every word as a single string. Thats why tried to store more words in a singe list of strings, but that didnt work. Do you guys have some thoughts on the issue? – Epimetheus Dec 15 '20 at 13:47
  • Yeah, you should be able to define the list of words as a list or tuple (or, if you need to treat them in different ways, as a dictionary). Then the condition would be something like: `if word not in image_links` – Jiří Baum Dec 16 '20 at 01:42
0

Im also a fresh Python diciple, thats why i want to explain the answer in my own thoughts that might help people who watch the thread in the future.

As the previous answer correctly said, the original funtction only iterated over single chars (I,M,A,G,E) and not over words (IMAGE). This lead also to the removal of all single chars defined in image_link_1 & 2.

text.split() takes care of that problem since the orginial string is splitted into words not chars.

The working code:

def remove_link(text): 
    remove_im = ' '.join([i for i in text.split() if i not in [image_link_1, image_link_2]])
    return remove_im

df['title_and_abstract'] = df['title_and_abstract'].apply(lambda x: remove_link(x))
Epimetheus
  • 373
  • 2
  • 17