-1

Task:

I have a task to match the strings in my first column of csv file to log files, if it exist then put the matched string in the third column otherwise put "undetected"

Contents of my log file -trendx.log Contents of my csv file - sha1_vsdt.csv

Expected Output:

enter image description here

Code:

So far I have used this concept using pandaframe and numpy, just followed somebody's advice

import numpy as np
import pandas as pd
import csv

#Log data into dataframe using genfromtxt
logdata = np.genfromtxt("trendx.log", delimiter="   ",invalid_raise = False,dtype=str, comments=None,usecols=np.arange(0,24))
logframe = pd.DataFrame(logdata)
#Dataframe trimmed to use only SHA1, PRG and IP
df2=(logframe[[10,14,15]]).rename(columns={10:'SHA1', 14: 'PRG',15:'IP'})


#sha1_vsdt data into dataframe using read_csv
df1=pd.read_csv("sha1_vsdt.csv",delimiter=r"|",error_bad_lines=False,engine = 'python',quoting=3)
#Using merge to compare the two CSV
df = pd.merge(df1, df2, left_on='SHA-1', right_on='SHA1', how='left').replace(np.nan, 'undetected', regex=True)
print df[['SHA-1','VSDT','PRG','IP']]

Then I'm having this error:

Warning (from warnings module):
  File "C:\Users\Administrator\Desktop\OJT\match.py", line 6
    logdata = np.genfromtxt("trendx.log", delimiter="   ",invalid_raise = False,dtype=str, comments=None,usecols=np.arange(0,24))
ConversionWarning: Some errors were detected !

    Line #1 - #113 (got 1 columns instead of 24)

Traceback (most recent call last):
  File "C:\Users\Administrator\Desktop\OJT\match.py", line 9, in <module>
    df2=(logframe[[10,14,15]]).rename(columns={10:'SHA1', 14: 'PRG',15:'IP'})
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
    .format(mask=objarr[mask]))
KeyError: '[10 14 15] not in index'
  • Could you try `logframe.iloc[: , [10,14,15]]`? – bitnahian Oct 03 '18 at 04:35
  • still the same error –  Oct 03 '18 at 04:45
  • Check the dimensions of your `logframe` dataframe. It could may well be that those columns don't exist. To see the column headers, try `list(logframe)` to confirm whether those columns exist. Or try `logframe.shape` to check the dimensions. – bitnahian Oct 03 '18 at 04:48
  • I see you finally figured out your actual problem. Perhaps this question should now be marked as a duplicate of https://stackoverflow.com/questions/52661863/how-to-delete-non-ascii-characters-in-a-text-file/52661986#52661986 or simply deleted. – tripleee Oct 05 '18 at 09:07

1 Answers1

0

This code should work. You don't need to pass in a delimiter for np.genfromtxt as it defaults to delimiting on whitespace which is what you probably want.

Also, the delimiter for pd.read_csv should be "," as it's a csv file.

import numpy as np
import pandas as pd
import csv

#Log data into dataframe using genfromtxt
logdata = np.genfromtxt("trendx.log",invalid_raise = False,dtype=str, comments=None,usecols=np.arange(0,24))
logframe = pd.DataFrame(logdata)
#Dataframe trimmed to use only SHA1, PRG and IP
df2=(logframe[[10,14,15]]).rename(columns={10:'SHA1', 14: 'PRG',15:'IP'})


#sha1_vsdt data into dataframe using read_csv
df1=pd.read_csv("sha1_vsdt.csv",delimiter=",",error_bad_lines=False,engine = 'python',quoting=3)
#Using merge to compare the two CSV
df = pd.merge(df1, df2, left_on='SHA-1', right_on='SHA1', how='left').replace(np.nan, 'undetected', regex=True)
print(df[['SHA-1','VSDT','PRG','IP']])

This code yields an output of

                                                 SHA-1      ...                   IP
0             0191a23ee122bdb0c69008971e365ec530bf03f5      ...           undetected
1             02b809d4edee752d9286677ea30e8a76114aa324      ...           undetected
2             0349e0101d8458b6d05860fbee2b4a6d7fa2038d      ...           undetected
3             035a7afca8b72cf1c05f6062814836ee31091559      ...           undetected
4             042065bec5a655f3daec1442addf5acb8f1aa824      ...           undetected
5             04939e040d9e85f84d2e2eb28343d94a50ed46ac      ...           undetected
6             04a1876724b53a016cd9e9c93735985938c91fa4      ...           undetected
7             06109df23f7d5deadf0b2c158af1f71c2997d245      ...           undetected
8             06194c240c12c51b55d2961ae287fd9628e05751      ...           undetected
9             0665de1ad83715cc6e68d00ed700c469944a5925      ...           undetected
10            067b448f4c9782489e5ff60c31c62b7059e500b2      ...           undetected
11            0688e6966b0e4a1f58d2f3de48f960fce5b42292      ...           undetected
12            0689f6f99d10dd8bf396f2d2c73ce9dcb6dcad23      ...           undetected
13            06a60c6018a42b1db22e3bf8620861711401c4bb      ...           undetected
14            0723a895a5f8b2d5d25b4303e9f04d16551791b6      ...           undetected
15            07344621cf4480c430f8931af2b2b056775af7e3      ...           undetected
16            07831df482f1a34310fc4f5a092c333eeaff4380      ...           undetected
17            08386105057cd5867480095696a5ca6701fdb8ad      ...           undetected
18            0ad5f62b4ec10397b7d13433a8dc794dc6d4f273      ...           undetected
19            0bed7d032d5c51f606befd2f10b94e5c75a6a1e3      ...        Administrator
20            0c3f8d2cce9e7a6e5604b8d0c9fbe1ff6fd5cebb      ...           undetected
21            0c793b4f4e0be7f24f93786d7d4a719a7a002a0d      ...           undetected
bitnahian
  • 516
  • 5
  • 17
  • Warning (from warnings module): File "C:\Users\Administrator\Desktop\OJT\match.py", line 6 logdata = np.genfromtxt("trendx.log",invalid_raise = False,dtype=str, comments=None,usecols=np.arange(0,24)) ConversionWarning: Some errors were detected ! Line #113 (got 1 columns instead of 24) –  Oct 03 '18 at 05:53
  • im really close to the solution, but it does not show me the "Administrator" unlike yours where did i go wrong –  Oct 03 '18 at 06:22
  • Might be the case that you're using python2 – bitnahian Oct 03 '18 at 06:27
  • im using python 3 Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32 but still not working for some lines –  Oct 04 '18 at 06:31
  • what version do you use? –  Oct 04 '18 at 06:31
  • I used python 3.7 – bitnahian Oct 04 '18 at 06:35
  • Sorry I don't remember :( – bitnahian Oct 04 '18 at 06:38
  • I've upgraded to python 3.7 but why it outputs undetected to all? please have a clear answer and will mark this accept –  Oct 04 '18 at 08:00