Background:
- I have a directory called "stuff" with 26 files (2 .txt and 24 .rtf) on Mac OS 10.7.5.
- I'm using grep (GNU v2.5.1) to find all strings within these 26 files that match the structure of a URL, then print them to a new file (output.txt).
- The regex below does work on a small scale. I ran it on directory with 3 files (1 .rtf and 2 .txt) with a bunch of dummy text and 30 URLs, and it executed successfully in less than 1 second.
I am using the following regex:
1
grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt
Problem
The current size of my directory "stuff" is 180 KB with 26 files. In terminal, I cd to this directory (stuff) then run my regex. I waited about 15 minutes and decided to kill the process as it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75GB (screenshot).
Question
- What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?
- What more could I add to my regex to streamline the processing time.
Thank you in advance for any guidance you can provide here. I've been working on many different variations of my regex for almost 16 hours, and have read tons of posts online but nothing seems to help. I'm new to writing regex, but with a small bit of hand holding, I think I'll get it.
Additional Comments
I ran the following command to see what was being recorded in the output.txt (19.75GB) file. It looks like the regex is finding the right strings with the exception of what i think are odd characters like: curly braces } {
and a string like: {\fldrslt
**TERMINAL**
$ head -n 100 output.txt
http://michacardenas.org/\
http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
http://www.mumia-themovie.com/"}}{\fldrslt
http://www.mumia-themovie.com/}}\
http://www.youtube.com/watch?v=Rvk2dAYkHW8\
http://seniorfitnesssite.com/category/senior-fitness-exercises\
http://www.giac.org/
http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt
http://www.youtube.com/watch?v=deOCqGMFFBE}}
https://angel.co/jason-a-hoffman\
https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}}
http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt
http://www.cooking-hacks.com/index.php/documentation
Catalog of regex commands tested so far
2
grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_2.txt)
3
grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_3.txt)
4
grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / produced blank file (output_4.txt)
5
grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_5.txt)
6
grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_6.txt)
7
grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_7.txt)
8
grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let run for 1O mins and manually killed process / produced 20.63 GB file (output_8.txt) / On the plus side, this regex captured strings that were accurate in the sense that they did NOT include any odd additional characters like curly braces or RTF file format syntax {\fldrslt
9
find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / produced blank file (output_9.txt)
10
find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / produced blank file (output_10.txt)
11
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
Editor's note: this regex only worked properly when I output strings to the terminal window. It did not work when I output to a file output_11.txt.
NEAR SUCCESS: All URL strings were cleanly cut to remove white space before and after string, and removed all special markup associated with .RTF format. Downside: of the sample URLs tested for accuracy, some were cut short losing their structure at the end. I'm estimating that about 10% of strings were improperly truncated.
Example of truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de
The question now becomes:
1.) Is there a way to ensure we do not eliminate a part of the URL string as in the example above?
2.) Would it help to define an escape command for the regex? (if that is even possible).
12
grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / produced blank file (output_12.txt)
13
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt
FAIL: let run for 2 mins and manually killed process / produced 1 GB file. The intention of this regex was to isolate grep's output file (output.txt) in to a subdirectory to ensure we weren't creating an infinite loop that had grep reading back it's own output. Solid idea, but no cigar (screenshot).
14
grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: Same result as #11. The command resulted in an infinite loop with truncated strings.
15
grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST WINNER: This captured the entirety of the URL string. It did result in an infinite loop creating millions of strings in terminal, but I can manually identify where the first loop starts and ends so this should be fine. GREAT JOB @acheong87! THANK YOU!
16
find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTIRE URL string, which is good. However, the command turned into an infinite loop. After about 5 seconds of running output to terminal, it produced about 1 million URL strings, which were all duplicates. This would have been a good expression if we could figure out how to escape it after a single loop.
17
ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt
NEAR SUCCESS: this command resulted in a single loop through all files in the directory, which is good b/c solved the infinite loop problem. However, the structure of the URL strings were truncated and included the filename from where the strings came from.
18
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented an infinite loop which is good, it created a new file in the directory it was querying which was small, about 30KB. It captured all the proper characters in the string and a couple ones not needed. As Floris mentioned, in the instances where the URL was NOT terminated with a space - for example http://www.mumia-themovie.com/"}}{\fldrslt
it captured the markup syntax.
19
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'
FAIL: This expression prevented an infinite loop which is good, however it did NOT capture the entire URL string.