0

I think this is two questions:

Q1) What is the pattern I should use to capture all int declarations in the source codes?

  • The regex pattern I have come up so far is: [\(<\s]?(int)[\s>\)]?
  • @WiktorStribiżew suggested in the comments to match whole word int. His solution is simpler than my first approach and does not cause the problem I ask in Question 2.

Q2) How do I tell sed to replace only the word inside the group capture? I.e.: replace int to int32_t but leave all other characters in the matching pattern untouched?

  • If I use sed -i -E 's/[\(<\s]?(int)[\s>\)]?/int32_t/g' file.cpp it will cause the following undesired effect: func(int a) -> funcint32_ta)

Here are some examples of test cases:

  1. int a; -> int32_t a;
  2. func(int a); -> func(int32_t a)
  3. template<int> a; -> template<int32_t> a;

EDIT 1: To simplify, the solution can ignore unsigned int, i.e., if it leads to unsigned int32_t it is okay for me.

EDIT 2: Some of you are asking why: this is a scientific computation application that will be distributed in some embedded processors. We want to guarantee our integer types are fixed to 32-bits.

EDIT 3: The source code is not large. In fact, I can (and probably will) verify each modification realized. I just took the opportunity to learn more about the regex.

rph
  • 901
  • 1
  • 10
  • 26
  • 4
    Are you trying to just match `int` as a whole word? `sed -i -E 's/\/int32_t/g' file.cpp`? – Wiktor Stribiżew Jul 18 '18 at 12:26
  • 7
    That's really hard because of edge cases. E.g. `operator++(int)` and `short int`. – Bathsheba Jul 18 '18 at 12:30
  • 4
    Suppose you have a problem. You then think "I know, I can use a regular expression!". You now have two problems, the original one and regukar expressions. Or; there is a RE that will solve this problem almost always, there is no RE that will solve this problem always; C++ tokenization cannot be done perfectly by anything short of the front end of a C++ compiler. How big is your code base? How dangerous are false positives? Negatives? – Yakk - Adam Nevraumont Jul 18 '18 at 12:31
  • 2
    EVIL: `#define int int32_t` – NathanOliver Jul 18 '18 at 12:37
  • Actually, less evil, do the # define as a cross-check: compile to preprocessed output, and verify that you get the same output after doing your sed. – Gem Taylor Jul 18 '18 at 12:46
  • 2
    What about this: `char msg ="For example splitting internaly would give int ern aly"; cout << msg << endl;`? I wouldn't dare using naive text processing on a C++ source file – Serge Ballesta Jul 18 '18 at 12:47
  • You can either do "non-capturing groups"; "\b"; or capture your prefix and suffix expressions and then instance them in the output: `s/([\(<\s])(int)([\s>\)])/\1int32_t\3/g` But the `[]?` syntax does you no favours as that makes the wrapping characters optional. I'm also not sure \s works inside []? – Gem Taylor Jul 18 '18 at 12:48
  • 1
    As usual there is this gnome in the back of my head, cackling madly and yelling, "but *why*?" -- The (optional!) precise-width types are useful for binary APIs; for virtually everything else, either the native types, the `*_leastX_t`, or `*_fastX_t` types are the better architectural choice. – DevSolar Jul 18 '18 at 12:56
  • Hmm. I would not use regexp for this. There are refactor tools to do the job. Otherwise [bad things](https://stackoverflow.com/a/1732454/8157187) happen. – geza Jul 18 '18 at 13:01
  • Thanks for all the comments. Please, refer to the edits I did in the question to clarify some things. – rph Jul 18 '18 at 13:29
  • 1
    This task is not doable by regex period. You should use proper tool for that. For example you can dump AST from llvm and identify all places, where `int` is used as a type. – Slava Jul 18 '18 at 13:41
  • @slava but even after you have "identified all the places" you still have to do the actual replacement, right? I agree that I would be very careful doing it on my own code. – Gem Taylor Jul 18 '18 at 13:58
  • @GemTaylor right script of some sort is required, but at least that behavior should be predictable, or existing refactoring tool can be used. Anyway pure regex cannot parse syntax of C++. – Slava Jul 18 '18 at 14:00
  • @Slava It is not like I am trying to parse the entire C++ with regex. I am just capturing a tiny subset. Plus, I imagine that all improper replacements will lead to compilation errors which are safe enough for my context. – rph Jul 18 '18 at 14:08
  • @rkioji then look into your particular tiny subset and ask yourself - can this one be done with regex? By asking question here you ask for this in general. And in general it cannot be done reliable. Will false positive create bad problems? That's another complicated question and I do not think it worse the effort to answer that. – Slava Jul 18 '18 at 14:14
  • Btw if you just need to do it once on pretty small file - just put `int` as search string on editor (not sed) click search/replace and validate each case visually. – Slava Jul 18 '18 at 14:17
  • If you want to grant that `int` is not greater than 32 bits - why? Even embedded CPUs may have bigger machine words with best performance - if not now than in the near future. If you want to grant that `int` hasn't accidentally less than 32 bits than `assert(sizeof (int) >= 4);` could be sufficient (e.g. as one of the first lines in `main()`). BUT if you want to grant that `int` has granted 32 bits even if native machine word is smaller... Hmmm. ...then there is gone something wrong in your development from beginning. May be, it would be better to check everything by eye (instead of `sed`). – Scheff's Cat Jul 18 '18 at 14:18
  • @Scheff The code was initially implemented assuming `int` was 32-bits. We want to guarantee the variables continue to be 32-bits across other archs. Using assertion will only make the code less portable. – rph Jul 18 '18 at 14:28
  • For me, this sounds like the part of my comment after BUT is relevant... ;-) ...and I know (very well) that this is a tedious job... – Scheff's Cat Jul 18 '18 at 14:31
  • I would recommend a cross-check strategy, which is to generate a preprocesor dump of your modules before doing the change, then use your regex to convert all the symbols to something very unique (eg XXXintXXX search for it before using it) then #define your symbol to int, and generate another preprocessor output. The two outputs should be close enough to use diff to spot any issues (like changing text in a string). Once you are happy with this change the #define to typedef int32_tand build. Once you are happy with this, rename XXXintXXX to int32_t – Gem Taylor Jul 18 '18 at 15:16
  • Note there are other special cases, like `unsigned int` will be difficult to detect. I would allow it to convert, then convert `unsigned\s*XXXintXXX` back, or to XXXuintXXX or uint32_t – Gem Taylor Jul 18 '18 at 15:19

0 Answers0