Regex that removes everything except specified string

Question

I am working with data that looks something like this:

{"score":0,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}
{"score":-1,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}
{"score":1,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}

The only information that I am interested in is the "score":# (which could be either positive or negative). Since I am working with thousands of lines that look like above, I am trying to extract only the score information that I am interested in using a regular expression.

I have consulted various posts, such as here, here and here, for example, but none of them seem to address my problem.

I have used them to try to write my own regular expression. Thus far, I have tried things such as:

(?!"score":(-)?[0-9])

^(?!"score":(-)?[0-9].*

(.(?!"score":(-)?[0-9]))*

but each of these examples selects ALL of the information, including what I am interested in.

How can I modify these regular expressions to arrive at my desired result, which is:

"score":0
"score":-1
"score":1

These are JSON strings, and if they appear line by line, you can read the file line by line, parse the string and get the `score` value. Why use regex? — Wiktor Stribiżew, Sep 09 '15 at 08:50
I was trying to find a solution that would automatically remove all of the other information that I am not interested in. This information is one column in a rather large TSV file, which was why I wanted to isolate that information. — owwoow14, Sep 09 '15 at 08:51
Are you using a programming language for this? It would be a very simple thing to do with python, for example. Or are you just using a text editor? — melwil, Sep 09 '15 at 08:52
Use [`.*("score":[-+]?\d*\.?\d+).*`](https://regex101.com/r/qU8eU4/1), replace with `$1`. If you need no float number support, just `.*("score":[-+]?\d+).*` is enough. — Wiktor Stribiżew, Sep 09 '15 at 08:54
I was just using a text editor at the moment because I thought it would be something much simpler than it ended up being. — owwoow14, Sep 09 '15 at 08:55
@stribizhev it worked great. Why don't you put your proposal as an answer? — owwoow14, Sep 09 '15 at 08:56
@owwoow14: Posted, let me know if you need any more clarifications. I will go on drinking coffee :) — Wiktor Stribiżew, Sep 09 '15 at 09:03

Wiktor Stribiżew · Accepted Answer · 2015-09-09T09:16:05.307

Your regexps do not work as expected:

(?!"score":(-)?\[0-9\]) matches empty spaces before each symbol that is not followed with "score":\d+
^(?!"score":(-)?\[0-9\].*) matches empty space at the beginning of a line
(.(?!"score":(-)?\[0-9\]))* matches every symbol but the opening {.

You can use

.*("score":[-+]?\d*\.?\d+).*

See demo

Replace with $1.

If you need no float number support, just use

.*("score":[-+]?\d+).*

See another demo

The main concept is matching all the line and capture the substring we need ("score":<number>). Then, we revert the captured text in the replacement string.

Here,

.* - matches any number of any characters other than a newline
("score":[-+]?\d*\.?\d+) - matches
- "score": - "score": literally
- [+-]? - either literal + or - (you can keep either - adjust for yourself)
- \d*\.?\d+ matches floats (with no thousand separators) or
- \d+ - matches a sequence of 1 or more digits.

score 0 · Answer 2 · answered Sep 09 '15 at 08:56

0

I have created a development sample here: https://regex101.com/r/yL7hA9/1

it is:

"score":(-)?[0-9]+

feel free to modify to your requirements.

answered Sep 09 '15 at 08:56

Pieter21

1,765
1
10
22

Regex that removes everything except specified string

2 Answers2