Taking Information From Line Iteration Using Regex and Storing Data in Nested Dictionaries

Question

Overview

I am trying to take information from a file (read line by line), split the data using Regex into a dictionary of dictionaries (nested dictionaries), where the format is

{
IP1: {
    Message1 {count},
    Message2 {count}, 
    Message3 {count}, 
    etc.
    }, 
IP2: {
    Message1 {count}, 
    Message2 {count}, 
    Message3 {count}, 
    etc.
    }, 
etc.
}.

I just am stuck in taking this information, and placing it into a dictionary of dictionaries with a count per each message.

I have a basic working prototype, where it stores {IP1: {message1}, IP2: {message2}, etc.} but it does not store additional messages if number is already in the dictionary. I am using Regex to filter out the noise, and focus only on the signal (what I need), which works without any problems. But I am tied up in the data-structure portion, and getting count for the messages.

What I have tried

Create nested dictionary on the fly in Python

https://www.youtube.com/watch?v=ygRINYibL74

How do you create nested dict in Python?

https://www.youtube.com/watch?v=K8L6KVGG-7o

https://www.youtube.com/watch?v=c9HbsUSWilw

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

https://docs.python.org/2/library/re.html

I tried creating a pandas array, and numpy array as well, but it still did not work out for me. I have used many other sources which I did not bookmark, in trying to help me out, but can not remember them all.

Sample Code

def cleanData(rawData):
     cleanIP = ipSubstitution(rawData)
     cleanPacket = packetSubstitution(cleanIP)
     cleanMessage = messageSubstitution(cleanPacket)
     cleanRepeats = repeatedSubstition(cleanMessage)
     cleanHexadecimal = hexadecimalSubstition(cleanRepeats)
     cleanTime = removeTime(cleanHexadecimal)
     return cleanTime

def ipSubstitution(ip):
     ip_exchange = re.compile(r"[a-z]{2,4}=[0-9]{0,3}.[0-9]{0,3}.[0-9]{0,3}.[0-9]{0,3}")
     replaceIP = "XXX.XXX.XXX.XXX"
     ipReturn = re.sub(ip_exchange, replaceIP, ip)
     return ipReturn

def packetSubstitution(packet):
     packetSubstitute = re.compile(r"[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}")
     replacePacket = "XX_XX_XXX_XXXX_XXX"
     packetReturn = re.sub(packetSubstitute, replacePacket, packet)
     return packetReturn

def messageSubstitution(message):
     messageExchange = re.compile(r"[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}.+")
     replaceMessage = "XX_XXX_XXXXX_XXXXX_XXXXX"
     messageReturn = re.sub(messageExchange, replaceMessage, message)
     return messageReturn

def repeatedSubstition(repeatedMessage):
     repeatedExchange = re.compile(r"[a-z]{7}\s[a-z]{8}.+")
     replaceRepeated = "XXXXXXX"
     repeatedReturn = re.sub(repeatedExchange, replaceRepeated, repeatedMessage)
     return repeatedReturn

def hexadecimalSubstition(hexadecimalMessage):
     hexadecimalExchange = re.compile(r"[a-z0-9]{12}\s[ok]{2}")
     replaceHexadecimal = "XXXXXXXXXXXX"
     hexadecimalReturn = re.sub(hexadecimalExchange, replaceHexadecimal, hexadecimalMessage)
     return hexadecimalReturn

def removeTime(timeRemoval):
     timeExchange = re.compile(r"([A-Z][a-z][a-z]\s\d{2})+\s+(\d{2}\:\d{2}\:\d{2})+\s")
     replaceTime = ""
     timeReturn = re.sub(timeExchange, replaceTime, timeRemoval)
     return timeReturn

def matchExpression(analysisDict,rawText):
     d = analysisDict
     regexMatch = re.compile(
     r"(?P<IP_Address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:)+\s" +
     "(?P<Message>.+)")
     matchedText = re.match(regexMatch, rawText)
     for _ in rawText:
          count = 0
          ip = matchedText.group("IP_Address")
          message = matchedText.group("Message")
          d.setdefault(ip, {})["Message"] = message
          d[ip][message] = count + 1
          // wanting to add count here but given an error regarding
          // string can not hold variable
          for doe in d:
               if numbers == d[doe]:
                    for ray in d[doe]:
                         if message == d[doe][ray]:
                              d[doe][ray] = count + 1

analysisDict = {}

textFile = open('text_file', 'r')

for line in textFile:
   //cleanData is a function calling other functions to
   //clear away unneeded text
    cleanData(line)
    matchExpression(analysisDict,cleanData(line))

I am stuck in figuring out how to get more than one message added, to each number, and adding the count to each message.

If anyone could please help me out, I would be extremely grateful.

Update 8/18/2019

Sample input

Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.220, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.89, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: system: Fetch shared memory 1000 of size 182224, address at 0xb6254000
Jan 29 05:23:11 11.111.11.111: message repeated 8 times
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: security: Admin "<admin>" successfully logged in
Jan 29 05:23:11 11.111.11.111: message repeated 15 times
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.220, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.217, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: security: admin:<save config current http://11.11.130.145:0000/run-config>

After Regex

"IP1: Message"
"IP2: Message"
"IP1: different message"
"IP2: different message"
etc.

What I need help with

Storing more than one message per IP address (currently holds just one)

 for _ in rawText:
      count = 0
      ip = matchedText.group("IP_Address")
      message = matchedText.group("Message")
      d.setdefault(ip, {})["Message"] = message
      d[ip][message] = count + 1
      // wanting to add count here but given an error regarding
      // string can not hold variable

Making a sub dictionary per message and storing their individual counts

      for doe in d:
           if ip == d[doe]:
                for ray in d[doe]:
                     if message == d[doe][ray]:
                          d[doe][ray] = count + 1

Expected Output

{
Number1: {
    message1: {count1}, 
    message2: {count2}
    },
Number2: {
    message1: {count1}, 
    message2: {count2}
    },
etc.
}

Actual Output

{'xxx.xxx.xxx.xxx:': {'Message': '...: XX_XX_XXX_XXXX_XXX: Packet(XXX.XXX.XXX.XXX, XXX.XXX.XXX.XXX) dropped'}, 'xxx.xxx.xxx.xxx:': {'Message': '...: admin:<save config current http://xxx.xxx.xxx.xxx:xxxx/zzz-zzzzzz>'}, 'xxx.xxx.xxx.xxx:': {'Message': '...: arbitrator ip is updated from xxx.xxx.xxx.xxx to xxx.xxx.xxx.xxx'}, 'xxx.xxx.xxx.xxx:': {'Message': '[...]: XX_XXX_XXXXX_XXXXX_XXXXX'}, 'xxx.xxx.xxx.xxx:': {'Message': 'application: get ... cancel.'}}

Final Update

After watching some more videos, I just used the .split(":", 1) function when appending to an array, and then created a Pandas Dataframe, adding one final column with count. From there, I did the necessary analysis for the problem I was trying to solve. Hope this helps anyone else in the future!

Right now it looks like your _atom_ is `"Number(\d+):Message([^"]*)"` Can you verify that ? Look man, nobody is going to help you unless you show an actual input sample .. — , Aug 15 '19 at 18:26
I edited it with a clearer output. The problem is not so much in my Regex, as I know it is getting the proper data, my issue is the storing of the data into a dictionary of dictionaries. — grice_d, Aug 15 '19 at 19:03
I just showed you the _atom_ concept. It's the smallest piece of data you need to extract at a time to make a complete entry in, _I don't care how many quadruple nested array's you got_. Forget about the regex, focus on the data you need to make a complete single data transaction. But, the minimum, needed. Trust me, arrays are simple. You will probably get the data in a loop, where you process each transaction as it matches. Or, you can get all the matches _findall()_ and process them in a loop later. — , Aug 15 '19 at 19:09
Ah okay, I apologize I did not know what an atom was (thank you for clarifying). I do get all of the data when appending to arrays, but I am trying to make dictionaries as they seem to be easier for data accessing, and analyzing. Versus an array, where you would have to match NumberX to MessageY (if using two different arrays). I am thinking in the long run, which is why I was hoping to get the count of every message stored into a dictionary, which matches the number. — grice_d, Aug 15 '19 at 19:22
I found this hard to understand without the complete (executable) code and example input it can be run on. You have some prose talking about numbers and messages, then some code with IP addresses, "doe"s and "ray"s, then sample output (formatted on one line!) with strings not appearing anywhere else in the question... Also can you maybe simplify the code to focus on the bits you need help with? — Nickolay, Aug 16 '19 at 22:56
@Nickolay I updated the post with sample input and hopefully easier format. — grice_d, Aug 18 '19 at 05:06
Thanks, it's clearer now, though the code still can't be run, so I had to guess the bit you were missing; you could get a better answer if you provided the actual code and the actual error you were getting. — Nickolay, Aug 18 '19 at 15:06

score 0 · Accepted Answer · answered Aug 18 '19 at 15:05

0

I think you're looking for this:

import collections

TEST_DATA = [
    ("ip1", "message1"),
    ("ip1", "message2"),
    ("ip2", "message3"),
    ("ip1", "message1"),
    ("ip2", "message3"),
]

# This dict maps IPs to Counter objects, with each "counter"
# mapping messages for the given IP to their counts.
counts = collections.defaultdict(collections.Counter)

for ip, msg in TEST_DATA:  # the loop to get the ips and messages is different in your case
    counts[ip][msg] += 1

# If you want, you can convert to plain dicts:
print({ ip: dict(msg_counts) for ip, msg_counts in counts.items() })
# Prints:
#   {'ip1': {'message1': 2, 'message2': 1}, 'ip2': {'message3': 2}}

answered Aug 18 '19 at 15:05

Nickolay

31,095
13
107
185

Thank you for the comment, I think this helps for the counter problem. Do I need to store the data into an array, then create a new function which using the regex expressions (in my matchedText.group) do the counting? That is where I am having the biggest issue. I know the Regex captures my expressions, but I am having difficulty with regards to storing them to get the counts. I updated the sample code with what I have in my program, and gave an excerpt of the raw data. – grice_d Aug 19 '19 at 12:47
If you managed to get the correct `(ip,msg)` pairs (so that you can print something like the "After Regex" section of your Q), you can count them by running the `counts[ip][msg] += 1` line from my answer whenever you get a new pair, assuming you've initialized `counts` earlier. – Nickolay Aug 19 '19 at 16:48