I am very newbie to hadoop and unable to understand the concept well, I had followed below process
Installed Hadoop by seeing here
Tried the basic examples in tutorial by seeing here and worcount example in python and working fine with them.
Actually what i am trying to do/the requirement i got is processing an apache log files in fedora(linux) located at /var/log/httpd
with hadoop using python in the below format
IP address Count of IP Pages accessed by IP address
I know that apache log files will be of two kinds
access_logs
error_logs
but i am really unable to understand the format of apache log files.
My apache log file content is something like below
::1 - - [29/Oct/2012:15:20:15 +0530] "GET /phpMyAdmin/ HTTP/1.1" 200 6961 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/cross_framing_protection.js?ts=1336063073 HTTP/1.1" 200 331 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
::1 - - [29/Oct/2012:15:20:16 +0530] "GET /phpMyAdmin/js/jquery/jquery-1.6.2.js?ts=1336063073 HTTP/1.1" 200 92285 "http://localhost/phpMyAdmin/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.77 Safari/537.1"
Can anyone please explain me the structure of above/apache log files
I am confused on how to process the the log file with the data Ip address, countof ip address, pages accessed by ip address
Can anyone let me know how we can process the apache log files with haddop using python and above information and store the result in the above mentioned format
Also can anyone please provide a basic code in python for processing the apache log files in the above format, so that i will get an real time idea on how to process the files with python code and will extend them according to needs