2

I have some string which is similar to JSON file:

string <- "{'text': u'@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow', 'created_at': u'Tue May 20 08:16:55 +0000 2014'}"

I want to extract two strings -which are after text and created_at

@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow

and

Tue May 20 08:16:55 +0000 2014

I want to do it with regex not with fromJSON function or something like that. But actually I don't know how. Any suggestion?

gagolews
  • 12,836
  • 2
  • 50
  • 75
jjankowiak
  • 3,010
  • 6
  • 28
  • 45
  • Take a look here http://stackoverflow.com/questions/2061897/parse-json-with-r – Vincent Beltman Dec 17 '14 at 08:12
  • 2
    Parsing JSON string with regex can be very painful. What is the reason that you don't want to use built-in functions ? – hsz Dec 17 '14 at 08:12
  • That's still not valid JSON data. You really should try to get valid input. It will make life easier for you in the long run. – MrFlick Dec 17 '14 at 08:13
  • I know, that's why I want to treat it like normal string. `stri_extract_all_regex(wiadomosc, "\\{'text': u'.*")` gives me the whole text, how can I "say" in regex language that I want to end before 'created_at"? It would bu satisfing too. – jjankowiak Dec 17 '14 at 08:19

2 Answers2

2

Use \K to discard the previously matched characters from printing at the final. \K keeps the text matched so far out of the overall regex match.

> string <- "{'text': u'@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow', 'created_at': u'Tue May 20 08:16:55 +0000 2014'}"
> m <- gregexpr("'(?:text|created_at)':\\s+u'\\K[^']*", string, perl=TRUE)
> regmatches(string, m)
[[1]]
[1] "@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow"
[2] "Tue May 20 08:16:55 +0000 2014" 

OR

> library(stringr)
> str_extract_all(string, perl("'(?:text|created_at)':\\s+u'\\K[^']*"))[[1]]
[1] "@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow"
[2] "Tue May 20 08:16:55 +0000 2014"

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • You are a true Regex ninja, @AvinasRaj. I would go with something like https://www.regex101.com/r/vM2vZ5/1 since `' u'` part is always fixed as well and wouldn't need a regex lookup. – Mehrad Dec 17 '14 at 08:33
2
(?<=text':\su')[^']+|(?<=created_at':\su')[^']+

You can try this .See demo.

https://regex101.com/r/eZ0yP4/27

vks
  • 67,027
  • 10
  • 91
  • 124