0

Suppose I have a sequence of strings that looks something like this:

1 10 46565 5968678 3 567 78

I would like to turn it into

F(1) F(10) F(46565) F(5968678) F(3) F(567) F(78)

Is there a regex one-liner that will accomplish that in Stata with an arbitary number of elements?

I tried:

. display ustrregexra("1 10 46565 5968678 3 567 78","([:digit:]){1,}","XXX")
XXX XXX XXX XXX XXX XXX XXX

and

. display ustrregexra("1 10 46565 5968678 3 567 78","([:digit:]){1,}","F(&)")
F(&) F(&) F(&) F(&) F(&) F(&) F(&)

and

. display ustrregexra("1 10 46565 5968678 3 567 78","[0-9]{1,}","F(&)")
F(&) F(&) F(&) F(&) F(&) F(&) F(&)

In VI, this seems to do the trick:

.s/[0-9]\{1,}/F(&)/g

Is there any equivalent of that in Stata for the unicode or vanilla regex functions? Stata's ustrregex* functions are bases on the ICU regex engine according to this comment by a StataCorp programmer.

dimitriy
  • 9,077
  • 2
  • 25
  • 50
  • 3
    As a *generic* regex, you can do `s/(\d+)/F(\1)/g` [demo](https://regex101.com/r/6BdZYI/1/) – dawg Aug 30 '18 at 03:46
  • The following works with the example at hand: `dis subinstr("F("+"1 10 46565 5968678 3 567 78"+")"," ", ") F(",.)` – Robert Picard Aug 30 '18 at 16:24
  • @RobertPicard this is not a regex. The OP asks for a regex specifically. –  Aug 30 '18 at 16:32
  • 2
    OK, here's a regex version: `dis ustrregexra("F("+"1 10 46565 5968678 3 567 78"+")"," ", ") F(")` – Robert Picard Aug 30 '18 at 16:36
  • 1
    @RobertPicard this is the same thing. It works but it is not a regex in the traditional sense. It is string substitution. Clever trick though. –  Aug 30 '18 at 16:48

1 Answers1

2

There are two problems here:

  1. Stata does not support regular expressions of the kind you mention.
  2. Its regular expression functions cannot handle substitutions such as F(\1).

There is only one way to do it in one (rather long) line:

clear
set obs 1

generate str = "1 10 46565 5968678 3 567 78"

local regex ([0-9]*)[ ]([0-9]*)[ ]([0-9]*)[ ]([0-9]*)[ ]([0-9]*)[ ]([0-9]*)[ ]([0-9]*)

generate new_str  = "F(" + regexs(1) + ") " + ///
                    "F(" + regexs(2) + ") " + ///
                    "F(" + regexs(3) + ") " + ///
                    "F(" + regexs(4) + ") " + ///
                    "F(" + regexs(5) + ") " + ///
                    "F(" + regexs(6) + ") " + ///
                    "F(" + regexs(7) + ")" if regexm(str, "`regex'")

. list, abbreviate(10)

     +--------------------------------------------------------------------------------+
     |                         str                                            new_str |
     |--------------------------------------------------------------------------------|
  1. | 1 10 46565 5968678 3 567 78   F(1) F(10) F(46565) F(5968678) F(3) F(567) F(78) |
     +--------------------------------------------------------------------------------+

You can obviously generalise this and make it a "true" one liner by writing a small program.


EDIT:

The following is a generalization that also exploits Robert's trick:

program define foo, rclass
local string `1'
local string = ustrregexra("`string'","\D"," ")
local string = ustrtrim(itrim("`string'"))
local string = ustrregexra("F("+"`string'"+")"," ", ") F(")
return local old_string `1'
return local new_string `string'
end

foo "1 10 46565 5968678 3 567 78"

return list

macros:
         r(new_string) : "F(1) F(10) F(46565) F(5968678) F(3) F(567) F(78)"
         r(old_string) : "1 10 46565 5968678 3 567 78"

foo "1xcvb10gh46565sdda5968678luiy3f567kl78"

return list

macros:
         r(new_string) : "F(1) F(10) F(46565) F(5968678) F(3) F(567) F(78)"
         r(old_string) : "1xcvb10gh46565sdda5968678luiy3f567kl78"
  • +1 I was aware of this, but it requires knowing the number of elements, which is something that I was not very clear about in the original post. I think your suggestion of a small program is the right one, where I can code that cardinality calculation in. I really thought the newer unicode regex engine would be able to handle this. – dimitriy Aug 30 '18 at 18:34
  • Thanks and i agree - I have always found regex in Stata to be sub-par. The lack of detailed documentation i suspect plays an important role, In the end, i decided that instead of guessing the quirks of Stata's regex engine it is best to switch to Python. –  Aug 30 '18 at 19:50
  • I think creating a small program is trivial and the most flexible way forward. Robert's clever solution is fine for your use case here but in more complex cases such as a string of the kind `1xcvb10gh46565sdda5968678luiy3f567kl78` it cannot do what you want ***and*** keep the letters. This is because it relies on simple string substitution (in this case a space). –  Aug 30 '18 at 19:50