-2

I'm Having this log file "internet.log" which is about 10GB. When I parse it in python I get an exception "MemoryError". The log file looks something like this...

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: query[A] fd-geoycpi-uno.gycpi.b.yahoodns.net from 192.168.1.33
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:23 dnsmasq[1979]: query[A] armdl.adobe.com from 192.168.1.24

I'm currently using this method to parse the log file:

def parse():
Date = []
IPAddress = []
DomainsVisited = []
with open("internet.log", "r") as file:
    content = file.readlines()
    for items in content:
        if 'query[A]' in items:
            getDate(Date, items)
            getIPAddress(IPAddress, items)
            getDomainsVisited(DomainsVisited, items)
finalResult = [[i, j, k] for i, j, k in zip(Date, IPAddress, DomainsVisited)]
return display(finalResult)

If I parse a log file of say some 10MB the output is being displayed but when I go to parse the 10GB log file I get the error. How can I Fix this? Thank you.

Manoj Jahgirdar
  • 172
  • 2
  • 12

2 Answers2

0

You should not use file.readlines(). Doing so immediately reads the whole file into memory, which is likely to fill it up immediately. Instead, iterate over the file:

with open("internet.log", "r") as file:
    for items in file:

(Of course, depending on what you're doing with the data this could still break as you go through the file.)

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
0

You're reading the whole file into memory with readlines.

You can read a line at a time by saying for items in file.

Cleaning up your code a little, using better variable names, and a list comprehension to build the result:

with open("internet.log") as log:
    finalResults = [[getDate(line), getIPAddress(line), getDomainsVisited(line)]
                    for line in log
                    if 'query[A]' in line]

I would extract the result to a function:

def parse_log_line(line):
    return [getDate(line),
            getIPAddress(line),
            getDomainsVisited(line)]

Then your code would be:

with open("internet.log") as log:
    finalResults = [parse_log_line(line)
                    for line in log
                    if 'query[A]' in line]
Peter Wood
  • 23,859
  • 5
  • 60
  • 99