Hey I'm really stuck and I really hope someone can help me with this. I'm trying to read the first 5000 lines of a CSV file, split the lines by tab delimiter, then search a regex pattern against every column and line, and output the column index number with the most regex matches/occurrences. I will provide an example to help better explain what i mean.
test.csv
john smith 1132 Anywhere Lane Hoboken NJ 10.0.0.1 07030 Jan 4
erica meyers 1234 Smith Lane Hoboken NJ 127.0.0.1 07030 March 2
erica meyers 1234 Smith Lane Hoboken NJ 192.168.1.1 07030 april 5
This is where I am currently at (read csv, separate into columns by tab delimiter, print first 100 lines ):
import csv
import re
Num = 5000
with open('test.csv', newline='', encoding="cp437", errors='ignore') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for i in range(Num):
lines = next(reader)
First few lines of current output:
['john smith', '1132 Anywhere Lane Hoboken NJ', '10.0.0.1', ' 07030', 'Jan 4']
['john smith', '1234 Smith Lane Hoboken NJ', '127.0.0.1', ' 07030', 'March 2']
['smith john', '1234 Smith Lane Hoboken NJ', '192.168.1.1', ' 07030', 'april 5']
here is where i am stuck...
I want to search the regex \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
against all columns on every line and output the column index number that had the most regex matches.
for this example \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
would match
10.0.0.1
127.0.0.1
192.168.1.1
so my desired output would be:
2