grepping out invalid URIs

Question

I have dbpedia's NTriple files. Some of them contain non absolute URIs, URI's that don't start with http://. This is causing problem to the parsing.

i.e. i have some triples that have URIs like <www.example.com> instead of <http://www.example.com>

I'd like to grep them out by negating them.

I tried, failing, with grep -v "^(<http)".

Any suggestion?

Edit

I probably made my point wrongly. These URI's aren't necessarily at the beginning of the line. That was my mistake in using the '^' operator as NOT. Also, I want to grep them out, with grep -v.

These are some sample lines:

<http://dbpedia.org/resource/Petrodvorets_Watch_Factory> <http://xmlns.com/foaf/0.1/homepage> <www.raketa.su> .

<http://dbpedia.org/resource/ABS_network> <http://xmlns.com/foaf/0.1/homepage> <www.absn.tv> .

score 2 · Answer 1 · answered Mar 03 '11 at 20:05

2

grep -P '^(?!<http).*'

(?!...) is a negative lookahead I did not test it so if you that does not work, search the web for 'regex negative lookahead' that should do the job

answered Mar 03 '11 at 20:05

Markus1189

2,829
1
23
32

that does indeed not work. I checked out [this answer](http://stackoverflow.com/questions/1749437/regular-expression-negative-lookahead) which makes me think your guess is correct. Can't understand what's going on. – marcorossi Mar 04 '11 at 16:57
that is strange, i tested it on a file i wrote with and lines and it worked... - maybe you can post a dummy file to test with it?# – Markus1189 Mar 04 '11 at 18:17
One question, is every url a line or can there be more? If copy paste your urls as lines (1 url 1 line) and save it to test > grep -P '^(?! – Markus1189 Mar 05 '11 at 08:34
Well, that changes alot. Can you just put 1 URL in 1 Line? Doing this would allow you to use the grep command. Atm I do not know a way to do it when 3 URLs are in 1 line... sorry To move the urls to single lines, i would use VIM: in edit mode press qa (macro record to 'a') --- then '0', f_(_ is space), 'x', 'i', 'ENTER', 'ESC', f_(_ is space), 'x', 'i', 'ENTER', f_(_ is space), 'd$', 'j', then 'q' ----- – Markus1189 Mar 05 '11 at 16:34
I found the solution: grep -P '<(?!http).*>'. the '<' defines the beginning of the pattern and is static. Then you define NOT followed by http and closed later. Thanks for getting me started on this one! – marcorossi Mar 08 '11 at 10:45

score 1 · Accepted Answer · answered Mar 08 '11 at 12:20

1

To handle multiple URIs per line the working regex is:

grep -P '<(?!http(s)?:\/\/).*>', to start with.

answered Mar 08 '11 at 12:20

marcorossi

1,941
2
21
34

score 0 · Answer 3 · answered Mar 03 '11 at 19:43

0

"^(<http)" would only match if "<http" is at the beginning of the line. Is that true in your case?

answered Mar 03 '11 at 19:43

neontapir

4,698
3
37
52

grepping out invalid URIs

3 Answers3