Regex to match text between two delimeters?

Question

Heres an example of the things I need to match on a request that I have stored as a text:

[{"id":"896","name":"TinyAuras","author_id":"654","author":"Kurisu</span></strong></span></a>","githubFolder":"https://github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj","count":9,"countByChampion":{"":9,"total":9},"description":"(Beta) Aura/Buff/Debuff Tracker","udate":"1451971516","createdDays":375,"image":"https://cdn.joduska.me/forum/uploads/assemblydb/image-default.jpg","strudate":"2016-07-22 19:40","champions":null,"forum_link":"165574","assembly_compiles":true,"voted":false,"voted_champions":[]},

I want to select that link up to the stop here (basically the github folder, not the actual csproj).

I have a file full of thousands of those and I'm trying to extract all of those links and put them in a text file.

Here is what I have so far for perl regex: (?<=githubFolder":").*(?=\/.+\.csproj") but that ends up selecting more than I need after the first match. Any suggestions?

The issue is, I want everything right before this.csproj.

So in my example I want to extract: https://github.com/xKurisu/TinyAuras/blob/master/

please share few more example links so that pattern can be identified. — Ambrish Pathak, Jan 14 '17 at 09:01
I added the working regex pattern to grab the url, just need to figure out how to only select up to the this.csproj — Ben, Jan 14 '17 at 09:02
What about `sed 's/$^.*$[.]csproj["]$/\1/' file > newfile`? (you can remove the `["]` if there is no **double-quote** at the end) You can add `"githubFolder":"` before the `\(` and remove the `^` if you need to get rid of `"githubFolder":"` — David C. Rankin, Jan 14 '17 at 09:08
Catches too much, I'll update the post with a bigger example. — Ben, Jan 14 '17 at 09:10
Don't you just want to extract the "githubFolder" from the JSON that's HTML highlighted? — choroba, Jan 14 '17 at 09:19

hashtable · Accepted Answer · 2017-04-08T00:06:37.507

2

This regex:

"githubFolder":"([^"]*/)[^"/]*"

selects:

https://github.com/xKurisu/TinyAuras/blob/master/

in your example.

However, it would likely be better to use an actual json parser as Jim D.'s answer suggests so you won't have to worry about spacing and special characters.

edited Apr 08 '17 at 00:06

answered Jan 14 '17 at 09:06

hashtable

80
9

You would move the trailing slash inside the capturing group to outside the capturing group like so: "githubFolder":"([^"]*)/[^"/]*" – hashtable Jan 14 '17 at 09:24
Your answer works when I text it on something like regexe, however when I tested it with grep -o -P here was my output: cat championswithGit.txt | grep -o -P 'githubFolder":"([^"]*/)[^"/]*' Output: githubFolder":"https:\/\/github.com\/ikkeflikkeri\/LeagueSharp\/blob\/master\/EasyCorki\/EasyCorki\/EasyCorki.csproj Any ideas? – Ben Jan 14 '17 at 09:34
1

Since we are parsing a JSON array of objects, one would expect that the string is JSON encoded, and as such could contain escaped quotes and other escape sequences that will need to be translated. JSON also allows for the insertion of white space between tokens. – JimD. Jan 14 '17 at 10:05
@Ben I think you've already solved this, but see http://stackoverflow.com/questions/1891797/capturing-groups-from-a-grep-regex – hashtable Apr 07 '17 at 23:58
@Jim D. Yes, it would probably be better to use an actual json parser as your answer suggests :) – hashtable Apr 08 '17 at 00:04

score 1 · Answer 2 · answered Jan 14 '17 at 10:30

While the accepted answer will likely get the job done here, I just want to point out that the old school linux tools are not easy to use to get 100% accurate results working with JSON, and for that reason, it would be best practice to use an actual JSON parser to extract your content.

One simple reason is that strings are JSON encoded so you will need to somehow decode them to insure you get the correct result. Another is that JSON is not a regular language, it is context free. You will need something more powerful than regular expressions in general.

One I am familiar with is jq, and the array of JSON objects can be parsed as the OP desires like this:

$ jq -r ' .[] | .githubFolder ' foo
https://github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj
https://github.com/xKurisu/"GiantAuras"/blob/master/GiantAuras.csproj
$

where file foo is

[
  {
    "id": "896",
    "name": "TinyAuras",
    "author_id": "654",
    "author": "Kurisu</span></strong></span></a>",
    "githubFolder": "https://github.com/xKurisu/TinyAuras/blob/master/TinyAuras.csproj",
    "count": 9,
    "countByChampion": {
      "": 9,
      "total": 9
    },
    "description": "(Beta) Aura/Buff/Debuff Tracker",
    "udate": "1451971516",
    "createdDays": 375,
    "image": "https://cdn.joduska.me/forum/uploads/assemblydb/image-default.jpg",
    "strudate": "2016-07-22 19:40",
    "champions": null,
    "forum_link": "165574",
    "assembly_compiles": true,
    "voted": false,
    "voted_champions": []
  },
  {
    "id": "888",
    "name": "\"GiantAuras\"",
    "author_id": "666",
    "author": "Astaire</span></strong></span></a>",
    "githubFolder": "https://github.com/xKurisu/\"GiantAuras\"/blob/master/GiantAuras.csproj",
    "count": 90,
    "countByChampion": {
      "": 777,
      "total": 42
    },
    "description": "(Stable) Aura/Buff/Debuff Tracker",
    "udate": "1451971517",
    "createdDays": 399,
    "image": "https://cdn.joduska.me/forum/uploads/assemblydb/image-default.jpg",
    "strudate": "2016-07-22 19:40",
    "champions": null,
    "forum_link": "165574",
    "assembly_compiles": true,
    "voted": false,
    "voted_champions": []
  }
]

Well that is usefull. I ended up doing this: cat championswithGit.txt | grep -oP '"githubFolder":"([^"]*/)[^"/]*' | grep -oP '.*(?=\/.+\.csproj)' | grep -oP '(?<="githubFolder":").*' | sed 's/\\//g' — Ben, Jan 14 '17 at 10:36
@Ben I think the `\/` comes from the fact that the solidus (`/`) may optionally be escaped in a JSON string as `\/`, which is more or less what I was trying to point out. This will probably solve your problem, but then someday there will be an escaped backslash and you have to fix that, and then another day a unicode escape sequence turns up... — JimD., Jan 14 '17 at 10:53

score 0 · Answer 3 · answered Jan 14 '17 at 09:11

0

Here is the regexp:

("githubFolder":".*)\/(.*\.csproj)

1. "githubFolder":"https://github.com/removed/removed/blob/master/stophere/this.csproj      
    1.1. Group: "githubFolder":"https://github.com/removed/removed/blob/master/stophere
    1.2. Group: this.csproj

you can test it here: http://www.regexe.com

answered Jan 14 '17 at 09:11

Dmitry Shilyaev

713
4
10

Don't use dot matching, it will end up wrong if there is another csproj later in the string... – Lucero Jan 14 '17 at 10:12

unbl0ck3r · Answer 4 · 2017-01-14T09:15:01.920

0

this pattern : (http|https):\/\/github\.com\/[\w\/]+\/ selects all directories which starts with github.com on your example.

edited Jan 14 '17 at 09:15

answered Jan 14 '17 at 09:14

unbl0ck3r

371
4
16

This assumes a bit too much in the naming convention for the githubs. I updated the OP – Ben Jan 14 '17 at 09:15

score 0 · Answer 5 · answered Jan 14 '17 at 09:15

0

Try this RegEx:

githubFolder":"([a-zA-Z:\/.]+\/)

It will Group the link upto last slash.

answered Jan 14 '17 at 09:15

Tittu Thomas

1
1

Regex to match text between two delimeters?

5 Answers5