-3

Using R:

  1. How do we calculate n for each use case?
    n = No. of string in the text

  2. How do we extract the string from a text given in the following use cases:

Text 1

Use case 1 (n=1)

Input: [AB]
Expected Output: (1x1 DF)
AB

Use case 2 (n=3)

Input : [AB],[BC],[A]
Expected Output: (3x1 DF)
AB
BC
A

Text 2

Use case 1 (n=1)

Input: "AB"
Expected Output: (1X1 DF)
AB

Use case 2 (n=2)

Input: "AB","B"
Expected Output: (2X1 DF)
AB
B

InVinci
  • 71
  • 8
  • `stringr::str_extract_all(c('[AB]', '[AB],[BC],[A]', '"AB"', '"AB","B"'), "[A-Z]+")` returns all of the strings you expect, so `lengths(.)` (around that) returns the number of substrings within each. – r2evans Feb 24 '21 at 14:36
  • 1
    Some good discussions on regular expressions (regex): https://stackoverflow.com/a/22944075/3358272 – r2evans Feb 24 '21 at 14:38
  • 1
    And https://stackoverflow.com/a/36695534/3358272 – r2evans Feb 24 '21 at 14:41
  • What if the Use case looks like : [AB],[BC_zz],[C]? – InVinci Feb 24 '21 at 14:46
  • Or just : [CD_x] – InVinci Feb 24 '21 at 14:46
  • Use `"[A-Z_]+"` for the pattern. *Or* you can go the inverted route, excluding separators, perhaps `"[^][\"',]+"` (untested). – r2evans Feb 24 '21 at 14:53
  • InVinci, I don't know how better to "generalize" a solution: with regexes, "general" can be good but often results in more complicated patterns ... and more complicated means more fragile. The code I suggested addressed the test cases you have in your question. If you have other cases that are significantly different, then you need to demonstrate the differences in the OP, and not rely on me to infer and guess what else could be present in your real data. The best rule for regexes is "keep is simple", don't over-engineer unless/until you have a clear well-bounded need. – r2evans Feb 24 '21 at 16:08
  • The general query is on the next comment by @ThomasIsCoding, that I asked for. Simple and Clear! Thanks for the downvote :) – InVinci Feb 24 '21 at 16:30
  • (1) I didn't give you that downvote, though I understand why you suspect that. (2) *"I asked for"* is not clear, nowhere did you demonstrate anything other than upper-case letters. I'm glad you received the help you need, but my first code perfectly addressed your use cases. The fact that it did not fit your underlying need reflects on how you communicated the problem constraints, not my suggested code. (3) *I don't care*, in the sense that this does not upset me. I get frustrated when comments suggest weakness in the solution that are based on secret constraints. – r2evans Feb 24 '21 at 16:43

1 Answers1

0

Hope the code below is what you are after.


Assuming we have strings s1 and s2

s1 <- "[AB],[B_C],[A]"
s2 <- '"A_B","B","C"'

and we apply

data.frame(s1 = regmatches(s1, gregexpr("\\w+", s1))[[1]])
data.frame(s2 = regmatches(s2, gregexpr("\\w+", s2))[[1]])

to get

> data.frame(s1 = regmatches(s1, gregexpr("\\w+", s1))[[1]])
   s1
1  AB
2 B_C
3   A

> data.frame(s2 = regmatches(s2, gregexpr("\\w+", s2))[[1]])
   s2
1 A_B
2   B
3   C
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81