5

Is there a way to replace all types of hyphens by the simple ASCII '-'? I am looking for something like this that works for spaces:

txt = re.sub(r'[\s]+',' ',txt)

I believe that some non-ASCII '-' hyphens are avoiding the correct process of removing some specific stopwords (name of projects that are connected by hyphens).

I want to replace this AR–L1003' for instance by AR-L1003, but I want to do this for the entire text.

bad_coder
  • 11,289
  • 20
  • 44
  • 72
DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65
  • 1
    Can you share a sample of the data that you wish to replace and the expected result? – ParvBanks Dec 12 '18 at 20:41
  • 1
    Why don't you look up all the hyphen characters that exist (https://en.wikipedia.org/wiki/Hyphen#Unicode) and put them in a regex `[ ]+`? – trincot Dec 12 '18 at 20:43
  • @trincot Yes, that was my question. I wonder if there is like \s a way to identify all hyphens! Maybe there is not! – DanielTheRocketMan Dec 12 '18 at 20:47

1 Answers1

4

You can just list those hyphens in a class. Here is one possible list -- extend it to your needs:

txt = re.sub(r'[‐᠆﹣-⁃−]+','-',txt)

The standard re library does not support the \p syntax for matching unicode categories, but if you can import regex, then it is possible:

import regex

txt = regex.sub(r'\p{Pd}+', '-', txt)
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Will `regex` replace `re`? –  Dec 12 '18 at 21:06
  • 2
    See [Add support for Matthew Barnett python regex module](https://github.com/firasdib/Regex101/issues/440). Also read Guido van Rossum speak on the subject [back in 2011](http://python.6.x6.nabble.com/Should-we-move-to-replace-re-with-regex-td1857882.html) – trincot Dec 12 '18 at 21:25