What is the correct way of matching a string inside a csv file to log file?

Question

Given:

I have this strings inside of my sha1_vsdt.csv and a trendx.log file

So this is my samples inside of my csv file

--------------------SHA-1---------------|-----VSDT-----
3ecca1d4af42561676de09019ddc94a52b49efcc|MS Office 1-0,
3f99507159f62331af7dedafeaac9da47fd9338b|MS Office 1-0,
3fdd26300c7f86c1a24dd8b13e99d5d7abea0604|WIN32 EXE 7-2,
4016bf58ee14e73cc42d8de918c6547c6b3b8f42|MS Office 1-0,
0e13d281af08954102e7caf95864ef553c7277bd|WIN32 EXE 7-2,

And samples inside of my trendx.log file:

1537762040  0   1   1   1537733240  1537733240  1537733240  8224    98  88064   0e13d281af08954102e7caf95864ef553c7277bd    Troj.Win32.TRX.XXPE50FFF026 c:\users\administrator\desktop\downloader\download\     Troj.Win32.TRX.XXPE50FFF026    Administrator           0e13d281af08954102e7caf95864ef553c7277bd        ACIKwAgACIAIAQAAMQAAAAAAAABAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

Task:

My task is to match the SHA-1 strings in my SHA-1 column and find it's match in my trendx.log file and when it matched it should get the description then put it in the third column like this:

--------------------SHA-1---------------|-----VSDT-----|-------MATCH--------
3ecca1d4af42561676de09019ddc94a52b49efcc|MS Office 1-0,|undetected
3f99507159f62331af7dedafeaac9da47fd9338b|MS Office 1-0,|undetected
3fdd26300c7f86c1a24dd8b13e99d5d7abea0604|WIN32 EXE 7-2,|undetected
4016bf58ee14e73cc42d8de918c6547c6b3b8f42|MS Office 1-0,|undetected
0e13d281af08954102e7caf95864ef553c7277bd|WIN32 EXE 7-2,|TRENDX  172.20.4.179

If it doesn't find a match then it should put undetected in the third column. I don't have an idea to do this, I'm very new to python, any ideas will be very helpful to me.

Here are my full contents of csv and log file:

sha1_vsdt.csv

trendx.log

Please show what you have implemented from your research around this problem. Presumably you've read both files in to some kind of structure? — roganjosh, Oct 02 '18 at 08:07
I can't find any research to it regarding the matchin strings into log files, so here I'm getting an idea to start my program — , Oct 02 '18 at 08:11
That is a specific detail of your problem. What you really want to do is read a csv file into a list, and read a text file into the list. Iterate through the lists and look for string matches at specific list indices with `==` and append the result to another list. Once you have that working, you can think of ways to make it more efficient with dictionaries. Point is, it's unlikely that a pre-made solution that fits your exact problem ever exists, you just need to break the individual steps down, and there's tonnes of content and examples on how to read CSVs — roganjosh, Oct 02 '18 at 08:13
I see you finally figured out your actual problem. Perhaps this question should now be marked as a duplicate of https://stackoverflow.com/questions/52661863/how-to-delete-non-ascii-characters-in-a-text-file/52661986#52661986 or simply deleted. — tripleee, Oct 05 '18 at 09:07

score 0 · Answer 1 · answered Oct 02 '18 at 08:56

@jeremydevera, this should get you going. You need a loop to go through the sha1_vsdt.csv file and then a matching section( When matched to the string the trendx log then use the value).

See a mockup below:

import csv
import re
trendx='1537762040  0   1   1   1537733240  1537733240  1537733240  8224    98  88064   0e13d281af08954102e7caf95864ef553c7277bd    Troj.Win32.TRX.XXPE50FFF026 c:\users\administrator\desktop\downloader\download\     Troj.Win32.TRX.XXPE50FFF026    Administrator           0e13d281af08954102e7caf95864ef553c7277bd        ACIKwAgACIAIAQAAMQAAAAAAAABAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA='
textsearch=re.findall(r'\S+', trendx)

with open('sha1_vsdt.csv', 'rt') as f:
    reader = csv.reader(f, delimiter='|')
    for row in reader:
        matched='undetected'
        if row[0]  == textsearch[10]:
            matched=textsearch[11]
        print [row[0],row[1],matched]

check your textsearch. the error seems to be saying that the split of textsearch[10], might not have that number of entries. — MEdwin, Oct 02 '18 at 09:11

score 0 · Answer 2 · answered Oct 02 '18 at 11:01

@jeremydevera, here is a more simplified version. I have used pandas dataframe to load the csv and log files. Then using merge to compare if there is any match.

import numpy as np
import pandas as pd
import csv

#Log data into dataframe using genfromtxt
logdata = np.genfromtxt("trendx.log", delimiter="   ",invalid_raise = False,dtype=str, comments=None,usecols=np.arange(0,24))
logframe = pd.DataFrame(logdata)
#Dataframe trimmed to use only SHA1, PRG and IP
df2=(logframe[[10,14,15]]).rename(columns={10:'SHA1', 14: 'PRG',15:'IP'})


#sha1_vsdt data into dataframe using read_csv
df1=pd.read_csv("sha1_vsdt.csv",delimiter=r"|",error_bad_lines=False,engine = 'python',quoting=3)
#Using merge to compare the two CSV
df = pd.merge(df1, df2, left_on='--------------------SHA-1---------------', right_on='SHA1', how='left').replace(np.nan, 'undetected', regex=True)
print df[['--------------------SHA-1---------------','-----VSDT-----','PRG','IP']]

.format(mask=objarr[mask])) KeyError: '[10 14 15] not in index' — , Oct 03 '18 at 02:43
and I have this from Line#1 - Line #113 (got 1 columns instead of 24) — , Oct 03 '18 at 02:43
@jeremydevera, all these are coming from the import "trendx.log". It seems it is not very structured for import into a dataframe. what it is saying is that there is a row in that file that the import only finds 1 column. Normally it expects 24 columns. And this also throws the error in key, because there is not column 10, 14, 15 where it can find the SHA1, PRG and IP on that row. Have a look at that file and see if it needs cleaning or the source has some issues. — MEdwin, Oct 03 '18 at 09:40

What is the correct way of matching a string inside a csv file to log file?

2 Answers2