-1

I am unable to build a Regex that matches all possible strings of the format

{\some_text}

I tried to build myself a Regex but I was unable make it match ALL kinds of characters.

What I came up with: r"\{\\(.*)\}"

This did not work properly, it only matched {\~some_string}

This is what I am trying to achieve:

text = "F.N. Freitas, C. Singulani, G. Vila-Verde, Linea Science Server,: The Dark Energy Survey Data Release 2. Ap._J._Supp._Ser. 255, (2021).Alam S., A. de Mattia, A. Tamone, S. {\' A}vila, J.A. Peacock, V. Gonzalez-Perez, A. Smith, A. Raichoor, A.J. Ross, J.E. Bautista, E. Burtin, J. Comparat, K.S. Dawson, H. du Mas des Bourboux, S. Escoffier, H. Gil-Mar{\'\i}n, S. Habib, K. Heitmann, J. Hou, F.G. Mohammad, E.M. Mueller, R. Neveux, R. Paviot, W.J. Percival, G. Rossi, V. Ruhlmann-Kleider, R. Tojeiro, M. Vargas Maga{\~n}a, C. Zhao, G.B. Zhao: The completed SDSS-IV extended Baryon Oscillation Spectroscopic Survey: N-body mock challenge for the eBOSS emission line galaxy sample. Mon._Not._R._Astron._Soc. 504, (2021).Alam S., J.A. Peacock, D.J. Farrow, J. Loveday, A.M. Hopkins: Using GAMA to probe the impact of small-scale galaxy physics on nonlinear redshift-space distortions. Mon._Not._R._Astron._Soc. 503, (2021).Alam S., M. Aubert, S. Avila, C. Balland, J.E. Bautista, M.A. Bershady, D. Bizyaev, M.R. Blanton, A.S. Bolton, J. Bovy, J. Brinkmann, J.R. Brownstein, E. Burtin, S. Chabanier, M.J. Chapman, P.D. Choi, C.H. Chuang, J. Comparat, M.C. Cousinou, A. Cuceu, K.S. Dawson, S. de la Torre, A. de Mattia, V.S. Agathe, H.M. des Bourboux, S. Escoffier, T. Etourneau, J. Farr, A. Font-Ribera, P.M. Frinchaboy, S. Fromenteau, H. Gil-Mar{\'\i}n, J.M. Le Goff, A.X. Gonzalez-Morales, V. Gonzalez-Perez, K. Grabowski, J. Guy, A.J. Hawken, J. Hou, H. Kong, J. Parker, M. Klaene, J.P. Kneib, S. Lin, D. Long, B.W. Lyke, A. de la Macorra, P. Martini, K. Masters, F.G. Mohammad, J. Moon, E.M. Mueller, A. Mu{\~n}oz-Guti{\'e}rrez, A.D. Myers, S. Nadathur, R. Neveux, J.A. Newman, P. Noterdaeme, A. Oravetz, D. Oravetz, N. Palanque-Delabrouille, K. Pan, R. Paviot, W.J. Percival, I. P{\'e}rez-R{\`a}fols, P. Petitjean, M.M. Pieri, A. Prakash, A. Raichoor, C. Ravoux, M. Rezaie, J. Rich, A.J. Ross, G. Rossi, R. Ruggeri, V. Ruhlmann-Kleider, A.G. S{\'a}nchez, F.J. S{\'a}nchez, J.R. S{\'a}nchez-Gallego, C. Sayres, D.P. Schneider, H.J. Seo, A. Shafieloo, A. Slosar, A. Smith, J. Stermer, A. Tamone, J.L. Tinker, R. Tojeiro, M. Vargas-Maga{\~n}a, A. Variu, Y. Wang, B.A. Weaver, A.M. Weijmans, C. Y{\`e}che, P. Zarrouk, C. Zhao, G.B. Zhao, Z. Zheng: Completed SDSS-IV extended Baryon Oscillation Spectroscopic Survey: Cosmological implications from two decades of spectroscopic surveys at the Apache Point Observatory. Physical_Review_D 103, (2021).Alam S., N.P. Ross, S. Eftekharzadeh, J.A. Peacock, J. Comparat, A.D. Myers, A.J. Ross: Quasars at intermediate redshift are not special; but they are often satellites. Mon._Not._R._Astron._Soc. 504, (2021).Alonso-Herrero A., S. Garc{\'\i}a-Burillo, S.F. H{\"o}nig, I. Garc{\'\i}a-Bernete, C. Ramos Almeida, O. Gonz{\'a}lez-Mart {'hallo}"


encodings = {
    "'": u'\u0300',
    "'\\": u'\u0301',
    "^": u'\u0302',
    "~": u'\u0303',
    "o":  u'\u00D8',
    "ss": 'ß'

}

# remove the encoding and replace it with its corresponding character
def repl(m):
    string = m.group()
    get_open_bracket_idx = string.find('{')
    get_close_bracket_idx = string.find('}')
    encoding = substring.substringByChar(
        string, startChar=string[get_open_bracket_idx + 1], endChar=string[get_close_bracket_idx - 2])
    string_content = string[get_close_bracket_idx - 1]
    string_and_encoding = encoding + string
    string_content = encodings.get(encoding, string_content) + string_content
    print()
    print(f'encoding: {encoding}')
    print(f'string content: {string_content}')
    print()
    return string_content


# This nearly works, it just matches {'some_text} which it shouldnt
changed_text = re.sub(r'\{\\?[^{}]*}', repl, text)
print(changed_text)

David Haase
  • 179
  • 1
  • 8
  • Try `re.findall(r"\{\\[^{}]*}", s)`, see https://regex101.com/r/abn3Om/1 – Wiktor Stribiżew Sep 16 '21 at 13:36
  • Can you provide piece of python code which *only matched `{\~some_string}`*? – Daweo Sep 16 '21 at 13:37
  • @WiktorStribiżew This only matches {\~some_string} and {\`some_string} for me. But not f.e. {\'some_string} – David Haase Sep 16 '21 at 13:41
  • Please provide a code demo showing your problem. [Look here](https://regex101.com/r/abn3Om/2), it matches if there is a backslash. You probably use `s = "{\'some_string}"` and think you have a backslash in the string. It is [NOT so](https://ideone.com/B4COD7). Make sure you test against the right string (`s = r"{\'some_string}"`). Or, do you need to match `'` as an alternative after opening `{`? Then try `re.findall(r'\{[\\\'][^{}]*}', text)` – Wiktor Stribiżew Sep 16 '21 at 13:41
  • 1
    The problem is that a single backslash before `'` in a regular string literal is part of a string escape sequence, it does not make a literal backslash in the string. The ``\\`` must be in the pattern. – Wiktor Stribiżew Sep 16 '21 at 13:47
  • David, so what is your input? What is the problem? – Wiktor Stribiżew Sep 16 '21 at 13:56
  • Sorry took me a moment to edit my example. The code is far from perfect but it illustrates what I want to achieve which is mapping those character encodings to the character they should represent. – David Haase Sep 16 '21 at 13:59
  • But do you understand that in the ``"{\'\i}"`` string literal, there is no ``\`` char? ``"{\'\i}"`` string literal represents ``{'\i}`` text. – Wiktor Stribiżew Sep 16 '21 at 14:17
  • Also, I get `NameError: name 'substring' is not defined`, see [demo](https://ideone.com/ieLOzg). – Wiktor Stribiżew Sep 16 '21 at 14:19
  • Sorry, I do not understand what your encodings mean, and your code does not run. – Wiktor Stribiżew Sep 16 '21 at 14:30
  • I am sorry for not being able to explain myself very well. What I am trying to achieve is take some text, like the one I gave as an example. What I need to do is write this text to a pdf. But I don't want the encoding for the special letters to display as they are (encoded) The encodings are the unicode codes for the characters I want to convert. 'substring ' is not defined because it is a library - I forgot to mention that. I am honestly quite lost with this task because both regex and python are quite new to me .... – David Haase Sep 16 '21 at 14:34
  • So I want f.e. {'′^o} to display as ô etc – David Haase Sep 16 '21 at 14:37
  • The comment above your last comment is not helping. But the last one means your encodings are not complete. You need to add all variations of `c`, `a`, etc. – Wiktor Stribiżew Sep 16 '21 at 14:39
  • Yes I do need to add more encodings. But not for every letter: `string_content = encodings.get(encoding, string_content) + string_content` combines all letters with their correct encoding. – David Haase Sep 16 '21 at 14:45
  • It should look like https://ideone.com/IQuX9j, not exactly, you need to fix the mappings. – Wiktor Stribiżew Sep 16 '21 at 15:04

1 Answers1

1

You need a regex that will match {, then capture any non-word chars into Group 1 and then a letter(s) into Group 2 before closing }. Then, you will be able to check group contents and build replacement strings dynamically.

The regex will look like

\{([^\w\s]+|_)\s*(\w)}

See the regex demo. Details:

  • \{ - a { char
  • ([^\w\s]+|_) - Group 1: a special char
  • \s* - zero or more whitespaces
  • (\w) - Group 2: any word char
  • } - } char.

Sample implementation in Python:

import re
text = r"{\' A}vila,  Y{\`e}che, {'hallo}"

encodings = {
    "\\'": u'\u0300',
    "\\`": u'\u0302',
}

def repl(m):
    encoding = m.group(1)
    string_content = m.group(2)
    if encoding in encodings:
        return string_content + encodings[encoding]
    return string_content

changed_text = re.sub(r'\{([^\w\s]+|_)\s*(\w)}', repl, text)
print(changed_text)
# => Àvila,  Yêche, {'hallo}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you for your answer! This nearly works for me. But there are two problems left: The first one being that I the string I get as input is not a raw-string. I looked on how to convert a string to a raw string here [link](https://stackoverflow.com/questions/4415259/convert-regular-python-string-to-raw-string/4415585) but this does not seem to work. Secondly formats with multiple letters after the encoding don't get recognized by the regex. F.e. {\ss} -> ß does not work. – David Haase Sep 17 '21 at 07:29
  • @DavidHaase If there is no backslash in the input, you cannot auto-magically conjure it. Adjust the code if your Group 1 is a mere `'` char then. We cannot help you, this is an issue on the data provider part. As for multiple letters, replace `\w` with `\w+`, i.e. `\{([^\w\s]+|_)\s*(\w+)}` – Wiktor Stribiżew Sep 17 '21 at 08:12