How to optimize a grep regular expression to match a URL

Question

Background:

I have a directory called "stuff" with 26 files (2 .txt and 24 .rtf) on Mac OS 10.7.5.
I'm using grep (GNU v2.5.1) to find all strings within these 26 files that match the structure of a URL, then print them to a new file (output.txt).
The regex below does work on a small scale. I ran it on directory with 3 files (1 .rtf and 2 .txt) with a bunch of dummy text and 30 URLs, and it executed successfully in less than 1 second.

I am using the following regex:

1

grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt

Problem

The current size of my directory "stuff" is 180 KB with 26 files. In terminal, I cd to this directory (stuff) then run my regex. I waited about 15 minutes and decided to kill the process as it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75GB (screenshot).

Question

What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?
What more could I add to my regex to streamline the processing time.

Thank you in advance for any guidance you can provide here. I've been working on many different variations of my regex for almost 16 hours, and have read tons of posts online but nothing seems to help. I'm new to writing regex, but with a small bit of hand holding, I think I'll get it.

Additional Comments

I ran the following command to see what was being recorded in the output.txt (19.75GB) file. It looks like the regex is finding the right strings with the exception of what i think are odd characters like: curly braces } { and a string like: {\fldrslt

    **TERMINAL**
    $ head -n 100 output.txt
    http://michacardenas.org/\
    http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
    http://www.mumia-themovie.com/"}}{\fldrslt 
    http://www.mumia-themovie.com/}}\
    http://www.youtube.com/watch?v=Rvk2dAYkHW8\
    http://seniorfitnesssite.com/category/senior-fitness-exercises\
    http://www.giac.org/ 
    http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt 
    http://www.youtube.com/watch?v=deOCqGMFFBE}}
    https://angel.co/jason-a-hoffman\
    https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}} 
    http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/documentation

Catalog of regex commands tested so far

2

grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_2.txt)

3

grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_3.txt)

4

grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / produced blank file (output_4.txt)

5

grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_5.txt)

6

grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_6.txt)

7

grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_7.txt)

8

grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let run for 1O mins and manually killed process / produced 20.63 GB file (output_8.txt) / On the plus side, this regex captured strings that were accurate in the sense that they did NOT include any odd additional characters like curly braces or RTF file format syntax {\fldrslt

9

find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / produced blank file (output_9.txt)

10

find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / produced blank file (output_10.txt)

11

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf

Editor's note: this regex only worked properly when I output strings to the terminal window. It did not work when I output to a file output_11.txt.

NEAR SUCCESS: All URL strings were cleanly cut to remove white space before and after string, and removed all special markup associated with .RTF format. Downside: of the sample URLs tested for accuracy, some were cut short losing their structure at the end. I'm estimating that about 10% of strings were improperly truncated.

Example of truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de

The question now becomes:
1.) Is there a way to ensure we do not eliminate a part of the URL string as in the example above?
2.) Would it help to define an escape command for the regex? (if that is even possible).

12

grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / produced blank file (output_12.txt)

13

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt

FAIL: let run for 2 mins and manually killed process / produced 1 GB file. The intention of this regex was to isolate grep's output file (output.txt) in to a subdirectory to ensure we weren't creating an infinite loop that had grep reading back it's own output. Solid idea, but no cigar (screenshot).

14

grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: Same result as #11. The command resulted in an infinite loop with truncated strings.

15

grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST WINNER: This captured the entirety of the URL string. It did result in an infinite loop creating millions of strings in terminal, but I can manually identify where the first loop starts and ends so this should be fine. GREAT JOB @acheong87! THANK YOU!

16

find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTIRE URL string, which is good. However, the command turned into an infinite loop. After about 5 seconds of running output to terminal, it produced about 1 million URL strings, which were all duplicates. This would have been a good expression if we could figure out how to escape it after a single loop.

17

ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt

NEAR SUCCESS: this command resulted in a single loop through all files in the directory, which is good b/c solved the infinite loop problem. However, the structure of the URL strings were truncated and included the filename from where the strings came from.

18

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented an infinite loop which is good, it created a new file in the directory it was querying which was small, about 30KB. It captured all the proper characters in the string and a couple ones not needed. As Floris mentioned, in the instances where the URL was NOT terminated with a space - for example http://www.mumia-themovie.com/"}}{\fldrslt it captured the markup syntax.

19

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'
FAIL: This expression prevented an infinite loop which is good, however it did NOT capture the entire URL string.

What does the output file look like? That's kind of the key here. Do you see a lot of repeated lines? If so, do successive repeated lines look like they're growing or shortening in any way? Also, please do not make needless edits. — Andrew Cheong, Nov 17 '13 at 22:34
Hi @acheong87! I tried to open the output.txt but my computer just hangs on the process to open the file and doesn't produce it. I waited about 2 mins and then killed the process to open it. Do you know of a better way to inspect it's contents? Maybe I can use terminal to get a count of how many lines are in the file, would that be helpful? — George Jester, Nov 17 '13 at 22:37
Perl-style regular expressions have almost always been "experimental," though it's kinda like Google's decade-long "betas." Just for fun, remove the `-P` (perl-style) flag, and try `https?://\S+` (yes, greedy). — Andrew Cheong, Nov 17 '13 at 22:37
I edited my previous comment as I forgot to mention removing the `-P` flag. Anyway, to view your output, you can use `head` or `tail`, _e.g._ `head -n 100 ` to see the first 100 lines. — Andrew Cheong, Nov 17 '13 at 22:41
@GeorgeJester Don't redirect the output to a file. Find what is being matched. — , Nov 17 '13 at 22:41
@acheong87 I tried the following command per your updated comment: `#3 Greedy: $ grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt` and it produced a blank file. Going to try your suggestion of viewing the output using head. Biab. — George Jester, Nov 17 '13 at 22:49
@NishantShreshth Good idea. I'll just print the output in the terminal and see if it works. Thanks. — George Jester, Nov 17 '13 at 22:52
@acheong87 I added the output of the first 100 lines to my question above. I see what I "think" could be odd characters, but would be interested to hear if you think the same. — George Jester, Nov 17 '13 at 23:08
Something is really wrong if the expression you commented above (`#3 Greedy:`) isn't working. I'd start there and simplify until you understand what the problem is. For example, try `'https?://'` to make sure it's not the `\S` that's not working. If even that fails, keep simplifying. If that works fine though, try `'\S'` alone (match a non-whitespace character). As to the odd characters, it's probably just because you specify whitespace to be delimiter. In other words, `grep` thinks those curly braces are just part of the URL. You need to be more stringent, like `'https?://[\w~#%&_+=,.?/-]+'`. — Andrew Cheong, Nov 17 '13 at 23:17
I found [here (unrelated question)](http://stackoverflow.com/questions/2850575/what-is-the-rtf-syntax-for-a-hyperlink) an example of what you must be seeing. You must be searching through RTF files, among which the `}{\fldrslt` pattern is going to be common. You need to use the selective character class I mentioned in the previous comment. — Andrew Cheong, Nov 17 '13 at 23:19
@acheong87 thanks a bunch for your feedback on this. I tried a combination of things we shared here, and posted the results to my question above. So far no luck, but let me know if you see any alternatives. — George Jester, Nov 18 '13 at 00:01
Your tenacity is admirable. Let's try one more thing. Let's try removing the `-P` once more, but try a strict GNU `grep`-style syntax. Try `'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+'`. Note that the quantifiers are escaped, and we're avoiding shortcuts like `\w`. Any of the examples you tried without the `-P` may have failed because the quantifiers weren't converted. Now, on the Perl end, you might wanna try escaping the forward slashes, _i.e._ 'https?:\/\/[\w~#%&_+=,.?\/-]+' since they are technically special characters in Perl, though, it doesn't make sense here. Can't hurt to try. — Andrew Cheong, Nov 18 '13 at 00:07
@acheong87 OK; thanks for the quick ping back. I'm going to give that a shot. Quick thought though: after looking at the output.txt file a bit more, I think the reason it is growing so large in size is because it's looping through the 26 files in my directory over and over. Is there a way to exit the loop upon completing the last file in my directory? — George Jester, Nov 18 '13 at 00:32
Huh. Maybe the problem is your `grep` following symbolic links recursively? Take out the `-r` and execute your command in a few directories, manually—do you get desired results? If so, drop the recursive option in `grep`, and instead use `find`. I'm a little rusty on `find` but it's something like `find . -print | grep -iIrPoh ... > output.txt`. By the way, `tee` is useful for seeing your output _as_ you write it, _e.g._ `find . -print | grep -iIrPoh ... | tee output.txt`. I'll type the comments up into an answer if we figure out what's wrong. — Andrew Cheong, Nov 18 '13 at 00:42
@acheong87 we are so close! I tried a couple regex we spoke about here and posted results to the question above. There was a great success in your recommended "strict GNU grep-style syntax" (number 11 above). Please see my feedback though, just a small tweak and I think we may have it. You're a super hero, thanks a bunch for your efforts thus far. — George Jester, Nov 18 '13 at 03:56
Hm, odd, the URL breaks at the first uppercase! There's something wrong with either the `a-zA-Z` or the `-i` flag. Modify #11 to have just `a-z` (not `A-Z`). If that doesn't work, restore the `a-zA-Z` and now try removing the `-i` flag. Also, I realized I made a mistake with the `find` version—I think you should give it a shot though because it can get around the problem mentioned by @Alex—like this (and this time remembering `xargs`): `find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt`. — Andrew Cheong, Nov 18 '13 at 04:18
(I edited the previous comment a bunch of times so you might want to reload.) Note in the `find` version, you do _not_ supply `.` to the last `grep`, because the search targets are supplied by the `xargs` from the `find`, after excluding (`grep -v`) "output.txt". — Andrew Cheong, Nov 18 '13 at 04:21
This is an intriguing problem! It seems there are two different problems - one that suggests there is "recursion" (some kind of loop to generate huge files), and another which makes the regex inefficient/wrong. What happens if you simplify the regex: `ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt` ? This should pass a list of all `.txt` and `.rtf` files (but not `output.txt`) to `grep` for "exact" (`-F`, non-regex) matching. This definitely should complete in reasonable time; then you can remove the `-F` and start adding embellishments. — Floris, Nov 18 '13 at 05:18
@acheong87 YOU NAILED THE STRING CAPTURE! I removed the `-i` which gave me: `grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+'` and it captured the strings in their entirety. See my comment on #15. I also tried the other suggestions you had and placed my feedback above. The infinite loop was still present with regex #15, if you feel like going the extra mile, I'm open to hearing your suggestions on how to prevent the infinite loop. Conversely, feel free to write an answer to this thread so I can shower it with up votes. I think we are good. Thank you! — George Jester, Nov 18 '13 at 05:32
@Floris that is a great recap of the issues at hand, indeed it has been pretty interesting to troubleshoot. I'm going to try your suggestion and let you know how it turns out. — George Jester, Nov 18 '13 at 05:35
Another question - does it keel over if you just look for the `.txt` files? What if you just look for the `.rtf` files? What if you do this one file at a time - is there a specific file that causes this to fail? With just 26 files you could run this one at a time and see. Just looking for possible trouble shooting approaches. — Floris, Nov 18 '13 at 05:46
@Floris I tried your command and posted results to #17. It solved the infinite loop problem. :) WRT trying a regex on each file to identify if a specific file was causing the infinite loop - I think that is a good idea and I will give it a shot in the coming days and let you know. — George Jester, Nov 18 '13 at 05:53
Happy infinite loop problem went away. I suspect that had something to do with the way I fed the inputs to grep - with the method I used you DO NOT NEED the `--include` statements any more. I am thinking `s *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIoh 'https\?://[a-z./?#=%_-,~&]\+' {} > output.txt` might be quite close... — Floris, Nov 18 '13 at 06:11

Floris · Accepted Answer · 2013-11-19T00:22:23.380

The expression I had given on the comments (your test 17) was intended to test for two things:

1) can we make the infinite loop go away 2) can we loop over all files in the directory cleanly

I believe we achieved both. So now I am audacious enough to propose a "solution":

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'

Breaking it down:

ls *.rtf *.txt         - list all .rtf and .txt files
grep -v 'output.txt'   - skip 'output.txt' (in case it was left from a previous attempt)
xargs                  - "take each line of the input in turn and substitute it 
                       - at the end of the following command 
                       - (or use -J xxx to sub at place of xxx anywhere in command)
grep -i                - case insensitive
     -I                - skip binary (shouldn't have any since we only process .txt and .rtf...)
     -o                - print only the matched bit (not the entire line), i.e. just the URL
     -h                - don't include the name of the source file
     -E                - use extended regular expressions 

     'http             - match starts with http (there are many other URLs possible... but out of scope for this question)
      s?               - next character may be an s, or is not there
      ://              - literal characters that must be there
      [^[:space:]]+    - one or more "non space" characters (greedy... "as many as possible")

This seemed to work OK on a very simple set of files / URLs. I think that now that the iterating problem is solved, the rest is easy. There are tons of "URL validation" regexes online. Pick any one of them... the above expression really just searches for "everything that follows http until a space". If you end up with odd or missing matches let us know.

Looking again at the earlier outputs, I see that you have situations where the URL is not terminated with a space - for example `http://www.mumia-themovie.com/"}}{\fldrslt` which you want to end at the `"`, presumably. In which case you don't want `[^[:space:]]`, but something like `grep -iIohE 'https?://[a-z./?#=%_-,~&]+'` — Floris, Nov 18 '13 at 17:05
thanks a bunch for the break down. I ran the 2 regex your recommended here and added the results to my question, see #18 and #19. — George Jester, Nov 19 '13 at 00:12
Very happy to see I could help! If you have some "improper" URLs, do you know what character indicated the "end of URL" if it's not a space? Is it `"`? In that case, I _believe_ using `[^[:space:]"]` instead of just `[^[:space]]` should do the trick. Add other "terminator characters" as needed inside the outer brackets... — Floris, Nov 19 '13 at 00:22

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

1

I'm guessing a bit but for a line like

http://a.b.com something foo bar

the pattern can match as

http://a.b.com

http://a.b.com something

http://a.b.com something foo

(always with space at the end).

But I don't know if grep tries to match same line multiple times.

Better try

'https?://\S+\s'

as pattern

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 17 '13 at 22:40

Michael Butscher

10,028
4
24
25

Thank you! Going to give this is a shot and let you know how it turns out. – George Jester Nov 17 '13 at 22:54
I tried the following command per your suggestion, `grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > output_6.txt` and it resulted in a blank file with no strings, but it processed the command in a couple seconds. :) So it seems your recommendation made it run quicker, which is good. – George Jester Nov 17 '13 at 23:20

score 1 · Answer 3 · answered Nov 18 '13 at 02:50

1

"What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?" me thinks you are running a cycle with grep reading back its own output? Try directing the output to > ~/tmp/output.txt.

answered Nov 18 '13 at 02:50

Dr. Alex RE

1,772
1
15
23

That is huge! Really good thought. Going to try it now! BIAB. – George Jester Nov 18 '13 at 03:59
Wow. I'm putting my money on this theory. It reminds me of how people sometimes do things like `ps -ef | grep [c]ron`, so that `grep` wouldn't catch itself in the process list. – Andrew Cheong Nov 18 '13 at 04:05
This was a great idea, but sadly didn't pan out. I ended up with a huge output.txt file of 1GB. More details can be found in attempt #13 above. TY. – George Jester Nov 18 '13 at 04:49
1

Silly question - but I assume you had deleted your old `output.txt` file when you tested this? Because if you had not, it would be parsed, with disastrous consequences... – Floris Nov 18 '13 at 05:07
2

@Floris good point. I did delete the old output.txt each time I tried running a new command. TY – George Jester Nov 18 '13 at 05:36

How to optimize a grep regular expression to match a URL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

3 Answers3