0

I'm writing a script to scrape from another website with Python, and I am facing this question that I have yet to figure out a method to resolve it.

So say I have set to replace this particular string with something else.

word_replace_1 = 'dv'
namelist = soup.title.string.replace(word_replace_1,'11dv')

The script works fine, when the titles are dv234,dv123 etc. The output will be 11dv234, 11dv123.

However if the titles are, dv234, mixed with dvab123, even though I did not set dvab to be replaced with anything, the script is going to replace it to 11dvab123. What should I do here?

Also, if the title is a combination of alphabits,numbers and Korean characters, say DAV123ㄱㄴㄷ, how exactly should I make it to only spitting out DAV123, and adding - in between alphabits and numbers?

Python - making a function that would add "-" between letters

This gives me the idea to add - in between all characters, but is there a method to add - between character and number?

the only way atm I can think of is creating a table of replacing them, for example something like this

word_replace_3 = 'a1'
word_replace_4 = 'a2'
.......

and then print them out as

namelist3 = soup.title.string.replace(word_replace_3,'a-1').replace(word_replace_4,'a-2')

This is just slow and not efficient. What would be the best method to resolve this?

Thanks.

  • Consider using regex – Emrah Diril Feb 05 '20 at 23:17
  • thanks @EmrahDiril. So say in order to differentiate dv and dvab, should I write something like word_replace_1 = re.compile(r'\Bdvab') word_replace_2 = re.compile(r'\Bdv') but word_replace_2 is still going to process dvab since it's really looking for dv? – I suck at this Feb 06 '20 at 03:28
  • @EmrahDiril sorted the dv/dvab problem out by using if statement with regex. I have another question. from time to time with webscrapping, I get titles like "abcd 指定されたページが見つかりません". Now I want to skip that url whenever I get titles like this, so I wrote if re.compile(r'指定されたページが見つかりません').finditer(movie_title_raw): pass. I should be expecting it to skip it, however it's skipping the whole script. So i'm guessing when it's japanese characters, it's not the same as english alphabets. Can you point me to the right direction? Thanks – I suck at this Feb 06 '20 at 09:28
  • hmm I'm not sure how to handle japanese characters in regex. I would imagine they would just work but I guess not. Please write a separate question for it – Emrah Diril Feb 06 '20 at 16:05

0 Answers0