I am suppose to find if a given file is a media file, not through extension, but through the header information. So i opened some .MOV
file format with emacs to just observe what can be done what is inside etc . On analyzing the contents i found that some strings where not only in the first line(header information), but was also at the last few lines too. So basically strings i was looking for was on few bunch of lines at both starting and at the ending of the file.
Also It was inappropriate to find specific strings manually, so though of automating the process..
for example : this was the first line.
\00\00\00 ftypqt \00qt \00\00\00\00\00\00\00\00\00\00\00\00\00\00\00wide\00\CF\E1mdat\00\00\00wide\00\00\00\00mdat\00\00\00\00\00\00\00\00\E0\00\00\00\00\FF\A6\00\00\00\00\00\00 \00\00\00\008\00\00\82X\00\00\00@\80\00\87\F4N\CD
the last line was:
\F7\00\80\004\8D\00Z\A2\00\84p\00\9D\8F\00\B6\A5\00\CDt\00\DF\00\ED\8F\007\004\8C\00A\9D\00\00\00udta\00\00\00\00\00\00\00Wudta\00\00\00hinv7.6\00\00\00@hnti\00\00\008rtp sdp b=AS:265 b=TIAS:259 a=maxprate:31.000000 \00\00\00\00
so i had to scan the whole file line by line for specific type of string. But at first i had to know what are all the strings that i have to be looking for in each line of the file. So i thought of scanning some random media files and extract contents that where looking like a word(inside these files a word wasn't having any space character on either side, but what i was looking for was within a/A - z/Z and 0-9
) to me.
Having this as my scenario, first thing came in mind was to use regex
. But i later realized from SO that
awk
can do paragraph-oriented operations.
Then i came across here saying that
Emacs Lisp is a good choice if you need sophisticated string or pattern matching capabilities.
So, Finally i wanted to get inside each file(various files with extension like flv
, mp3
, mov
, avi
, mp4
, mkv
and so on) and look for words(that looked like words to me, say any english alphabet that appears consecutively with atleast 3 characters and above. say for example, in the above block commented section mentioned as first line/header information, you can see ftypqt
, which contains english alphabets and has more that 3 characters consecutively). Then write those words into a different file so that i can open that file and see only those words that are picked from each line form each file.
Can anyone please give some idea about which would be fine, using regex
/ awk
/ emacs lisp
or anything else. Please forgive if my english is bad.