Extracting columns from "JSON" file

Question

I have some string which is similar to JSON file:

string <- "{'text': u'@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow', 'created_at': u'Tue May 20 08:16:55 +0000 2014'}"

I want to extract two strings -which are after text and created_at

@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow

and

Tue May 20 08:16:55 +0000 2014

I want to do it with regex not with fromJSON function or something like that. But actually I don't know how. Any suggestion?

Take a look here http://stackoverflow.com/questions/2061897/parse-json-with-r — Vincent Beltman, Dec 17 '14 at 08:12
Parsing JSON string with regex can be very painful. What is the reason that you don't want to use built-in functions ? — hsz, Dec 17 '14 at 08:12
That's still not valid JSON data. You really should try to get valid input. It will make life easier for you in the long run. — MrFlick, Dec 17 '14 at 08:13
I know, that's why I want to treat it like normal string. `stri_extract_all_regex(wiadomosc, "\\{'text': u'.*")` gives me the whole text, how can I "say" in regex language that I want to end before 'created_at"? It would bu satisfing too. — jjankowiak, Dec 17 '14 at 08:19

Avinash Raj · Answer 1 · 2014-12-17T08:40:21.693

Use \K to discard the previously matched characters from printing at the final. \K keeps the text matched so far out of the overall regex match.

> string <- "{'text': u'@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow', 'created_at': u'Tue May 20 08:16:55 +0000 2014'}"
> m <- gregexpr("'(?:text|created_at)':\\s+u'\\K[^']*", string, perl=TRUE)
> regmatches(string, m)
[[1]]
[1] "@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow"
[2] "Tue May 20 08:16:55 +0000 2014"

OR

> library(stringr)
> str_extract_all(string, perl("'(?:text|created_at)':\\s+u'\\K[^']*"))[[1]]
[1] "@RobertTekieli @Czerniakowianka @1234Mania mysle, ze nie weszlabym do zadnego wiezienia bez straznikow"
[2] "Tue May 20 08:16:55 +0000 2014"

DEMO

You are a true Regex ninja, @AvinasRaj. I would go with something like https://www.regex101.com/r/vM2vZ5/1 since `' u'` part is always fixed as well and wouldn't need a regex lookup. — Mehrad, Dec 17 '14 at 08:33

score 2 · Accepted Answer · answered Dec 17 '14 at 08:30

2

(?<=text':\su')[^']+|(?<=created_at':\su')[^']+

You can try this .See demo.

https://regex101.com/r/eZ0yP4/27

answered Dec 17 '14 at 08:30

vks

67,027
10
91
124

Extracting columns from "JSON" file

2 Answers2