0

I have a 10 000 lines source code with tons of duplication. So I read in the file as text.

Example:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = <double*>real0.data

I want to replace all occurances of this pattern with

    real0_data = _get_data(real0, "real0")

where real0 can be any variable name [a-z0-9]+


So don't get confused by the source code. The code doesn't matter, this is text processing and regex.

This is what I have so far:

    PATH = "func.pyx"
    source_string = open(PATH,"r").read()

    pattern = r"""
    assert PyArray_TYPE\(([a-z0-9]+)\) == np.NPY_DOUBLE, "([a-z0-9]+) is not double"
    assert ([a-z0-9]+).ndim == 1, "([a-z0-9]+) has wrong dimensions"
    if not (PyArray_FLAGS(([a-z0-9]+)) & np.NPY_C_CONTIGUOUS):
       ([a-z0-9]+) = PyArray_GETCONTIGUOUS(([a-z0-9]+))
    ([a-z0-9]+)_data = ([a-z0-9]+).data"""

    

siamii
  • 23,374
  • 28
  • 93
  • 143
  • are you using `vim` as your editor? – Fredrik Pihl Apr 18 '13 at 17:40
  • This question is a duplicate with the question: [How to input a regex in string.replace in python?](http://stackoverflow.com/questions/5658369/how-to-input-a-regex-in-string-replace-in-python) – perror Apr 18 '13 at 17:41
  • _"where real0 can be any variable name [a-z0-9]+"_ You mean where the variable name is _not_ `real0`? If that's the case, and you have to actually decide what variable to place there, you're asking the impossible... – Ben Apr 18 '13 at 17:42
  • 1
    @perror No, it isn't. This is a much more complex problem as you have a wildcard in the middle that you have to detect *and remember*. – Ant P Apr 18 '13 at 17:43
  • @Ben - not at all, extract the variable name from the line containing `` and use that in the substitution – Fredrik Pihl Apr 18 '13 at 17:43
  • @Ant You are right, thanks to your comment I realized it was different. But, then is it related to: [What refactoring tools do you use for Python?](http://stackoverflow.com/questions/28796/what-refactoring-tools-do-you-use-for-python) – perror Apr 18 '13 at 17:51
  • 1
    @siamii - Can you make it a little less ambiguous what pattern you want to match and what you want to replace it with? It's not at all clear at the minute. You want to replace the whole block of code with the one-line function call? – Ant P Apr 18 '13 at 17:59

1 Answers1

1

You can do this in any text editor that supports multiline regular expression search and replace.

I used Komodo IDE to test this, because it includes an excellent regular expression tester ("Rx Toolkit") for experimenting with regular expressions. I think there are also some online tools like this. The same regular expression works in the free Komodo Edit. It should also work in most other editors that support Perl-compatible regular expressions.

In Komodo, I used the Replace dialog with the Regex option checked, to find:

assert PyArray_TYPE\((\w+)\) == np\.NPY_DOUBLE, "\1 is not double"\s*\n\s*assert \1\.ndim == 1, "\1 has wrong dimensions"\s*\n\s*if not \(PyArray_FLAGS\(\1\) & np\.NPY_C_CONTIGUOUS\):\s*\n\s*\1 = PyArray_GETCONTIGUOUS\(\1\)\s*\n\s*\1_data = <double\*>\1\.data

and replace it with:

\1_data = _get_data(\1, "\1")

Given this test code:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = <double*>real0.data

    assert PyArray_TYPE(real1) == np.NPY_DOUBLE, "real1 is not double"
    assert real1.ndim == 1, "real1 has wrong dimensions"
    if not (PyArray_FLAGS(real1) & np.NPY_C_CONTIGUOUS):
        real1 = PyArray_GETCONTIGUOUS(real1)
    real1_data = <double*>real1.data

    assert PyArray_TYPE(real2) == np.NPY_DOUBLE, "real2 is not double"
    assert real2.ndim == 1, "real2 has wrong dimensions"
    if not (PyArray_FLAGS(real2) & np.NPY_C_CONTIGUOUS):
        real2 = PyArray_GETCONTIGUOUS(real2)
    real2_data = <double*>real2.data

The result is:

    real0_data = _get_data(real0, "real0")

    real1_data = _get_data(real1, "real1")

    real2_data = _get_data(real2, "real2")

So how did I get that regular expression from your original code?

  1. Prefix all instances of (, ), ., and * with \ to escape them (an easy manual search and replace).
  2. Replace the first instance of real0 with (\w+). This matches and captures a string of alphanumeric characters.
  3. Replace the remaining instances of real0 with \1. This matches the text captured by (\w+).
  4. Replace each newline and the leading space on the next line with \s*\n\s*. This matches any trailing space on the line, plus the newline, plus all leading space on the next line. That way the regular expression works regardless of the nesting level of the code it's matching.

Finally, the "replace" text uses \1 where it needs the original captured text.

You could of course use a similar regular expression in Python if you want to do it that way. I would suggest using \w instead of [a-z0-9] just to make it simpler. Also, don't include the newlines and leading spaces; instead use the \s*\n\s* approach I used instead of the multiline string. This way it will be independent of the nesting level as I mentioned above.

Michael Geary
  • 28,450
  • 9
  • 65
  • 75
  • This looks good, but is it possible to do this in python or online. I would rather not install yet another IDE. I'm using Eclipse with Pydev – siamii Apr 18 '13 at 18:28
  • Yes, I added a couple of suggestions for doing this with Python. You can probably do it directly in Eclipse as well. – Michael Geary Apr 18 '13 at 18:32
  • And as a general note, I recommend not limiting yourself to a single editor or IDE. Komodo Edit is free; it doesn't cost you anything to use it. So are several other good text editors. I find it very helpful to have more than one editor available; if one isn't good at a particular task I can easily switch to another for that task. For example, I use Komodo IDE for most of my work, but it doesn't handle extremely large files very well. So I use other editors such as UltraEdit for very large files. – Michael Geary Apr 18 '13 at 18:34
  • I updated the answer to correct one thing: I forgot that you can use `\1` in the regular expression (and not just in the replacement string) to match the first `(\w+)` group. This takes care of the problem I mentioned where it wasn't checking that all of the `realN` strings had the same value for `realN`. – Michael Geary Apr 18 '13 at 19:12