0

I have dbpedia's NTriple files. Some of them contain non absolute URIs, URI's that don't start with http://. This is causing problem to the parsing.

i.e. i have some triples that have URIs like <www.example.com> instead of <http://www.example.com>

I'd like to grep them out by negating them.

I tried, failing, with grep -v "^(<http)".

Any suggestion?

Edit

I probably made my point wrongly. These URI's aren't necessarily at the beginning of the line. That was my mistake in using the '^' operator as NOT. Also, I want to grep them out, with grep -v.

These are some sample lines:

<http://dbpedia.org/resource/Petrodvorets_Watch_Factory> <http://xmlns.com/foaf/0.1/homepage> <www.raketa.su> .

<http://dbpedia.org/resource/ABS_network> <http://xmlns.com/foaf/0.1/homepage> <www.absn.tv> .

marcorossi
  • 1,941
  • 2
  • 21
  • 34

3 Answers3

2
grep -P '^(?!<http).*'

(?!...) is a negative lookahead I did not test it so if you that does not work, search the web for 'regex negative lookahead' that should do the job

Markus1189
  • 2,829
  • 1
  • 23
  • 32
  • that does indeed not work. I checked out [this answer](http://stackoverflow.com/questions/1749437/regular-expression-negative-lookahead) which makes me think your guess is correct. Can't understand what's going on. – marcorossi Mar 04 '11 at 16:57
  • that is strange, i tested it on a file i wrote with and lines and it worked... - maybe you can post a dummy file to test with it?# – Markus1189 Mar 04 '11 at 18:17
  • One question, is every url a line or can there be more? If copy paste your urls as lines (1 url 1 line) and save it to test > grep -P '^(?! – Markus1189 Mar 05 '11 at 08:34
  • Well, that changes alot. Can you just put 1 URL in 1 Line? Doing this would allow you to use the grep command. Atm I do not know a way to do it when 3 URLs are in 1 line... sorry To move the urls to single lines, i would use VIM: in edit mode press qa (macro record to 'a') --- then '0', f_(_ is space), 'x', 'i', 'ENTER', 'ESC', f_(_ is space), 'x', 'i', 'ENTER', f_(_ is space), 'd$', 'j', then 'q' ----- – Markus1189 Mar 05 '11 at 16:34
  • I found the solution: grep -P '<(?!http).*>'. the '<' defines the beginning of the pattern and is static. Then you define NOT followed by http and closed later. Thanks for getting me started on this one! – marcorossi Mar 08 '11 at 10:45
1

To handle multiple URIs per line the working regex is:

grep -P '<(?!http(s)?:\/\/).*>', to start with.

marcorossi
  • 1,941
  • 2
  • 21
  • 34
0

"^(<http)" would only match if "<http" is at the beginning of the line. Is that true in your case?

neontapir
  • 4,698
  • 3
  • 37
  • 52