Overview
I am trying to take information from a file (read line by line), split the data using Regex into a dictionary of dictionaries (nested dictionaries), where the format is
{
IP1: {
Message1 {count},
Message2 {count},
Message3 {count},
etc.
},
IP2: {
Message1 {count},
Message2 {count},
Message3 {count},
etc.
},
etc.
}.
I just am stuck in taking this information, and placing it into a dictionary of dictionaries with a count per each message.
I have a basic working prototype, where it stores {IP1: {message1}, IP2: {message2}, etc.} but it does not store additional messages if number is already in the dictionary. I am using Regex to filter out the noise, and focus only on the signal (what I need), which works without any problems. But I am tied up in the data-structure portion, and getting count for the messages.
What I have tried
Create nested dictionary on the fly in Python
https://www.youtube.com/watch?v=ygRINYibL74
How do you create nested dict in Python?
https://www.youtube.com/watch?v=K8L6KVGG-7o
https://www.youtube.com/watch?v=c9HbsUSWilw
https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
https://docs.python.org/2/library/re.html
I tried creating a pandas array, and numpy array as well, but it still did not work out for me. I have used many other sources which I did not bookmark, in trying to help me out, but can not remember them all.
Sample Code
def cleanData(rawData):
cleanIP = ipSubstitution(rawData)
cleanPacket = packetSubstitution(cleanIP)
cleanMessage = messageSubstitution(cleanPacket)
cleanRepeats = repeatedSubstition(cleanMessage)
cleanHexadecimal = hexadecimalSubstition(cleanRepeats)
cleanTime = removeTime(cleanHexadecimal)
return cleanTime
def ipSubstitution(ip):
ip_exchange = re.compile(r"[a-z]{2,4}=[0-9]{0,3}.[0-9]{0,3}.[0-9]{0,3}.[0-9]{0,3}")
replaceIP = "XXX.XXX.XXX.XXX"
ipReturn = re.sub(ip_exchange, replaceIP, ip)
return ipReturn
def packetSubstitution(packet):
packetSubstitute = re.compile(r"[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}_[a-z]{2,4}")
replacePacket = "XX_XX_XXX_XXXX_XXX"
packetReturn = re.sub(packetSubstitute, replacePacket, packet)
return packetReturn
def messageSubstitution(message):
messageExchange = re.compile(r"[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}_[a-z]{0,5}.+")
replaceMessage = "XX_XXX_XXXXX_XXXXX_XXXXX"
messageReturn = re.sub(messageExchange, replaceMessage, message)
return messageReturn
def repeatedSubstition(repeatedMessage):
repeatedExchange = re.compile(r"[a-z]{7}\s[a-z]{8}.+")
replaceRepeated = "XXXXXXX"
repeatedReturn = re.sub(repeatedExchange, replaceRepeated, repeatedMessage)
return repeatedReturn
def hexadecimalSubstition(hexadecimalMessage):
hexadecimalExchange = re.compile(r"[a-z0-9]{12}\s[ok]{2}")
replaceHexadecimal = "XXXXXXXXXXXX"
hexadecimalReturn = re.sub(hexadecimalExchange, replaceHexadecimal, hexadecimalMessage)
return hexadecimalReturn
def removeTime(timeRemoval):
timeExchange = re.compile(r"([A-Z][a-z][a-z]\s\d{2})+\s+(\d{2}\:\d{2}\:\d{2})+\s")
replaceTime = ""
timeReturn = re.sub(timeExchange, replaceTime, timeRemoval)
return timeReturn
def matchExpression(analysisDict,rawText):
d = analysisDict
regexMatch = re.compile(
r"(?P<IP_Address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:)+\s" +
"(?P<Message>.+)")
matchedText = re.match(regexMatch, rawText)
for _ in rawText:
count = 0
ip = matchedText.group("IP_Address")
message = matchedText.group("Message")
d.setdefault(ip, {})["Message"] = message
d[ip][message] = count + 1
// wanting to add count here but given an error regarding
// string can not hold variable
for doe in d:
if numbers == d[doe]:
for ray in d[doe]:
if message == d[doe][ray]:
d[doe][ray] = count + 1
analysisDict = {}
textFile = open('text_file', 'r')
for line in textFile:
//cleanData is a function calling other functions to
//clear away unneeded text
cleanData(line)
matchExpression(analysisDict,cleanData(line))
I am stuck in figuring out how to get more than one message added, to each number, and adding the count to each message.
If anyone could please help me out, I would be extremely grateful.
Update 8/18/2019
Sample input
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.220, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.89, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: system: Fetch shared memory 1000 of size 182224, address at 0xb6254000
Jan 29 05:23:11 11.111.11.111: message repeated 8 times
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: security: Admin "<admin>" successfully logged in
Jan 29 05:23:11 11.111.11.111: message repeated 15 times
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.220, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.217, dst=69.25.139.140) dropped
Jan 29 05:23:11 11.111.11.111: devmgmt: ah_tv_alg_proc_pkt: Packet(src=10.45.27.231, dst=69.25.139.140) dropped
Jan 29 05:23:11 22.222.22.222: security: admin:<save config current http://11.11.130.145:0000/run-config>
After Regex
"IP1: Message"
"IP2: Message"
"IP1: different message"
"IP2: different message"
etc.
What I need help with
Storing more than one message per IP address (currently holds just one)
for _ in rawText: count = 0 ip = matchedText.group("IP_Address") message = matchedText.group("Message") d.setdefault(ip, {})["Message"] = message d[ip][message] = count + 1 // wanting to add count here but given an error regarding // string can not hold variable
Making a sub dictionary per message and storing their individual counts
for doe in d: if ip == d[doe]: for ray in d[doe]: if message == d[doe][ray]: d[doe][ray] = count + 1
Expected Output
{
Number1: {
message1: {count1},
message2: {count2}
},
Number2: {
message1: {count1},
message2: {count2}
},
etc.
}
Actual Output
{'xxx.xxx.xxx.xxx:': {'Message': '...: XX_XX_XXX_XXXX_XXX: Packet(XXX.XXX.XXX.XXX, XXX.XXX.XXX.XXX) dropped'}, 'xxx.xxx.xxx.xxx:': {'Message': '...: admin:<save config current http://xxx.xxx.xxx.xxx:xxxx/zzz-zzzzzz>'}, 'xxx.xxx.xxx.xxx:': {'Message': '...: arbitrator ip is updated from xxx.xxx.xxx.xxx to xxx.xxx.xxx.xxx'}, 'xxx.xxx.xxx.xxx:': {'Message': '[...]: XX_XXX_XXXXX_XXXXX_XXXXX'}, 'xxx.xxx.xxx.xxx:': {'Message': 'application: get ... cancel.'}}
Final Update
After watching some more videos, I just used the .split(":", 1) function when appending to an array, and then created a Pandas Dataframe, adding one final column with count. From there, I did the necessary analysis for the problem I was trying to solve. Hope this helps anyone else in the future!