0

I need to write a python script to read a large logfile (1GB+), extract IP addresses in each line, store these IPs, remove duplicates, locate in another files the the hostnames related to these IPs and rewrite the hostnames to a new logfile containing the original data.

Now the question: What is the best way to deal with memory, files, etc? I mean, I see two approaches:

  1. Read the original logfile, extract IPs and write to a new file (tmp_IPS.txt, remove dupes, search these IPs line by line on another files (hostnames.txt), write the results to tmp_IPS.txt, read and rewrite original logfile. In this case, I will process less IPs (without the dupes).
  2. Read the original logfile, read the IPs and search each IP on the hostnames.txt, write the rows on original logfile + hostnames. In this case, I will process a lot of duplicated IPs. I can also write the found IPs and hostnames to a new file or to memory, but I really don't know what is better.
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
doublezero
  • 11
  • 5

4 Answers4

1

I foresee 2 possible scenarios for this typical common task so I'll comment very briefly about them.

Scenario 1) Reusing the logfile input data to make multiple queries or creating one or more output files out of it.

  1. Start by measure how long it'd take you to create an efficient memory data structure out of the whole file using Python builtin-blocks, if reading and creating a simple dictionary out of the whole logfile will take few seconds probably is worth not wasting more time coding a much more complex solution.

  2. Is it the previous step a very expensive operation? If that's the case and you're going to re-use the input data very often I'd probably create a database out of it (NoSQL or relational, depending the type of processing). If you're going to use very often the logfile data this way could be worthwhile.

Scenario2) You just want to process the input data once and throw off the script.

If this is the case the easiest solution would be starting by extract a very little data subset from the huge logfile so you can iterate as fast possible. Once you've got this data, create the whole script that achieves the whole process, once you're sure the script is tested and ready to go just let it run for few seconds (I can put my finger on it that running a simple script like that should take much less than 1 minute).

That said, the fact you've reached a point where you need to process&parse huge logfile like this is an indicator that maybe you should start thinking of storing log data in a more efficient manner... For instance, using solutions such as kibana or similar.

halfer
  • 19,824
  • 17
  • 99
  • 186
BPL
  • 9,632
  • 9
  • 59
  • 117
  • BPL, thank you! My intent is to use this script as a basis for some incident response cases, and the logfiles will vary from case to case. My first idea was to read the logfiles in chunks, work with these chunks and go on. I will consider use database for the hostnames at least. I'll report back after some tests – doublezero Jul 10 '18 at 02:13
0
  1. You should read in hostnames.txt and map the IPs to hostnames using a dict.

  2. You should then read the file in batches and check the hostnames using dict.get() i.e. host = host_dict.get(ip, None)

  3. I'm not sure what you mean when you say

    rewrite the hostnames to a new logfile containing the original data

But you can open() a file for appending very simply with

with open('new_logfile', 'a') as logfile:
    logfile.write(data_to_append)
aydow
  • 3,673
  • 2
  • 23
  • 40
  • For 1 and 2, I'm gonna check that. For 3, I don't want to modify the original file, but I'll need that information on the same file as the hostnames. Thank you! – doublezero Jul 10 '18 at 02:01
0

The most efficient way to deal with large log files in this case is to read and write at the same time, line by line, to avoid loading large files into memory. You should load the IP-to-hostname mapping file hostnames.txt in to a dict first if hostnames.txt is relatively small; otherwise you should consider storing the mapping in an indexed database)

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Thank you blhsing. The hostnames file is no more than 30MB, I really don't know if it's a good practice to maintain a dict that size in memory. About using a database, it will be a new thing to me in Python, but I'll consider that. – doublezero Jul 10 '18 at 02:05
  • You're welcome. 30MB is small enough for a dict. Use dict then. – blhsing Jul 10 '18 at 02:06
0

How do you plan on removing duplicates from temp_IPS.txt? Perhaps it makes more sense to simply avoid inserting duplicate IP Addresses into the file (and avoid storing duplicate IP Addresses in memory).

In terms of speed with Python file I/O, it can depend on the version of Python you're using. Assuming you've chosen Python 3, a loop such as:

for line in file.readlines() :
    # Code to deal with the string on each line

Will probably suit your use case very well assuming your file is well formatted.

I would suggest the following strategy:

  1. Open your raw file for reading
  2. Open a new file to store your desired output for writing
  3. Iterate over the raw file and only write to the new file when you find non-duplicate IPs (This will require storing all previously found IPs in memory)

There are many ways you could store the IP Addresses in memory, some are more space efficient, some are more time efficient. It depends on the hardware you are using and how much data you need to be able to read at once.

Woohoojin
  • 674
  • 1
  • 4
  • 19