0

I am working with data that looks something like this:

{"score":0,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}
{"score":-1,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}
{"score":1,"compare":0,"words":["book","planet","sun","science"],"words":[],"good":[],"bad":[]}

The only information that I am interested in is the "score":# (which could be either positive or negative). Since I am working with thousands of lines that look like above, I am trying to extract only the score information that I am interested in using a regular expression.

I have consulted various posts, such as here, here and here, for example, but none of them seem to address my problem.

I have used them to try to write my own regular expression. Thus far, I have tried things such as:

(?!"score":(-)?[0-9])

^(?!"score":(-)?[0-9].*

(.(?!"score":(-)?[0-9]))*

but each of these examples selects ALL of the information, including what I am interested in.

How can I modify these regular expressions to arrive at my desired result, which is:

"score":0
"score":-1
"score":1
Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43
  • Why not just match your desired text? – anubhava Sep 09 '15 at 08:48
  • 1
    These are JSON strings, and if they appear line by line, you can read the file line by line, parse the string and get the `score` value. Why use regex? – Wiktor Stribiżew Sep 09 '15 at 08:50
  • I was trying to find a solution that would automatically remove all of the other information that I am not interested in. This information is one column in a rather large TSV file, which was why I wanted to isolate that information. – owwoow14 Sep 09 '15 at 08:51
  • 1
    Are you using a programming language for this? It would be a very simple thing to do with python, for example. Or are you just using a text editor? – melwil Sep 09 '15 at 08:52
  • 1
    Use [`.*("score":[-+]?\d*\.?\d+).*`](https://regex101.com/r/qU8eU4/1), replace with `$1`. If you need no float number support, just `.*("score":[-+]?\d+).*` is enough. – Wiktor Stribiżew Sep 09 '15 at 08:54
  • I was just using a text editor at the moment because I thought it would be something much simpler than it ended up being. – owwoow14 Sep 09 '15 at 08:55
  • @stribizhev it worked great. Why don't you put your proposal as an answer? – owwoow14 Sep 09 '15 at 08:56
  • 1
    @owwoow14: Posted, let me know if you need any more clarifications. I will go on drinking coffee :) – Wiktor Stribiżew Sep 09 '15 at 09:03

2 Answers2

1

Your regexps do not work as expected:

  1. (?!"score":(-)?\[0-9\]) matches empty spaces before each symbol that is not followed with "score":\d+
  2. ^(?!"score":(-)?\[0-9\].*) matches empty space at the beginning of a line
  3. (.(?!"score":(-)?\[0-9\]))* matches every symbol but the opening {.

You can use

.*("score":[-+]?\d*\.?\d+).*

See demo

Replace with $1.

If you need no float number support, just use

.*("score":[-+]?\d+).*

See another demo

The main concept is matching all the line and capture the substring we need ("score":<number>). Then, we revert the captured text in the replacement string.

Here,

  • .* - matches any number of any characters other than a newline
  • ("score":[-+]?\d*\.?\d+) - matches
    • "score": - "score": literally
    • [+-]? - either literal + or - (you can keep either - adjust for yourself)
    • \d*\.?\d+ matches floats (with no thousand separators) or
    • \d+ - matches a sequence of 1 or more digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

I have created a development sample here: https://regex101.com/r/yL7hA9/1

it is:

"score":(-)?[0-9]+

feel free to modify to your requirements.

Pieter21
  • 1,765
  • 1
  • 10
  • 22