I need to write a python script to read a large logfile (1GB+), extract IP addresses in each line, store these IPs, remove duplicates, locate in another files the the hostnames related to these IPs and rewrite the hostnames to a new logfile containing the original data.
Now the question: What is the best way to deal with memory, files, etc? I mean, I see two approaches:
- Read the original logfile, extract IPs and write to a new file (
tmp_IPS.txt
, remove dupes, search these IPs line by line on another files (hostnames.txt
), write the results totmp_IPS.txt
, read and rewrite original logfile. In this case, I will process less IPs (without the dupes). - Read the original logfile, read the IPs and search each IP on the hostnames.txt, write the rows on original logfile + hostnames. In this case, I will process a lot of duplicated IPs. I can also write the found IPs and hostnames to a new file or to memory, but I really don't know what is better.