0

We have been given a PCAP file and my job is to find:

The user of the host PC tried to access some suspected website whose domain name ends with .top. Use Python (with the help of Regular Expression) to find the susceptible website.

By opening the PCAP file on notepad and doing a Ctrl + F search through it I have already found the correct answer to be: http://p27dokhpz2n7nvgr.1jw2lx.top

However this is obviously not the purpose of the assignment as I have to use Python and Regular Expression to return that website

The code I have tried so far is:

import re

pcapfile = open('CyberSecurity2019.pcap', 'rb')

mypattern = re.compile(rb"\S+\.top\b")

x = mypattern.findall(pcapfile.read())

print("x = ", x)

However this is what it returns:

x =  [b"c('_SS','R','20',0,'/');f=_w.top", b'g_triggerElems!==e&&(g_triggerElems[i].isHotSpotDisabled=!1);v=i+1,r=s[i],a=_ge("sc_hst"+v),a.style.left=r.locx+"%",a.style.top', b't=u.getBoundingClientRect(),o=t.width?Math.abs(t.right-t.left):t.width,a=s(u,"paddingLeft");o=o-(a?parseInt(a):0);v=t.height?Math.abs(t.bottom-t.top', b'n=document.getElementById(keyMap.Notification),t;n&&(n.parentNode.removeChild(n),t=document.getElementById("id_h"),t&&(t.style.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'http://p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top']

and that goes on and on for a while.

Any help in setting me on the right track would be appreciated.

Thank you

  • Escape the special chars that you want to be matched as literal chars, `re.compile(rb"\.top")`. Also, since you already compiled the regex object, use `x = mypattern.findall(pcapfile.read())`. Note that if you want to match a part before `.top`, use something like `rb'\S+\.top\b'` – Wiktor Stribiżew Mar 18 '19 at 10:39
  • Ok so I just implemented your suggestion and the return is x = [b"c('_SS','R','20',0,'/');f=_w.top", b'g_triggerElems!==e&&(g_triggerElems[i].isHotSpotDisabled=!1);v=i+1,r=s[i],a=_ge("sc_hst"+v),a.style.left=r.locx+"%",a.style.top', b't=u.getBoundingClientRect(),o=t.width?Math.abs(t.right-t.left):t.width,a=s(u,"paddingLeft");o=o-(a?parseInt(a):0);v=t.height?Math.abs(t.bottom-t.top', b'n=document.getElementById(keyMap.Notification),t;n&&(n.parentNode.removeChild(n),t=document.getElementById("id_h"),t&&(t.style.top', b'p27dokhpz2n7nvgr.1jw2lx.top', b'p27dokhpz2n7nvgr.1jw2lx.top',.... – SNIPERATI0N Mar 18 '19 at 10:43
  • No idea, you asked the question because the dot was not escaped, now, I do not know what the issue is. Please consider updating the question. – Wiktor Stribiżew Mar 18 '19 at 10:46
  • so the good news it that the code returns the website in quesiton. Is there anyway for the regular expression to filter out all the other junk and only return the website. I was thinking maybe there is a way to force it to only return a line if there is http:// at the beginning and .top at the end or something? – SNIPERATI0N Mar 18 '19 at 10:49
  • 1
    If all links start with `http`, use `rb'https?://\S+?\.top\b'` – Wiktor Stribiżew Mar 18 '19 at 10:51
  • perfect thank you, now how do I mark this question as solved? It's not giving me the option to at the top right – SNIPERATI0N Mar 18 '19 at 10:54
  • I see you updated the question, I will reopen it. – Wiktor Stribiżew Mar 18 '19 at 10:56
  • Would you also just quickly be able to explain how the different elements in your code suggestion: (rb'http?://\S+?\.top\b') work? – SNIPERATI0N Mar 18 '19 at 10:58

1 Answers1

1

Since all links you want to extract start with http or https you may use

rb'https?://\S+?\.top\b'

See the regex demo. Note that r string literal prefix defines a raw string literal (so as all backslashes were treated as literal backslashes and not as part of string escape sequences) and b is necessary here because PCAP files are binary, hence the pattern should also be a binary string.

Details

  • https?:// - http:// or https://
  • \S+? - 1 or more non-whitespace characters
  • \.top - a .top substring (note the escaped dot, an unescaped dot matches any char other than a line break char in Python re)
  • \b - a word boundary (note that r prefix allows the use of a single backslash to define a regex escape, if you do not use r prefix, you would need to write it as \\b)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    @SNIPERATI0N Just [added the link](https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it) – Wiktor Stribiżew Mar 18 '19 at 11:02