0

Hey I'm really stuck and I really hope someone can help me with this. I'm trying to read the first 5000 lines of a CSV file, split the lines by tab delimiter, then search a regex pattern against every column and line, and output the column index number with the most regex matches/occurrences. I will provide an example to help better explain what i mean.

test.csv

john smith  1132 Anywhere Lane Hoboken NJ   10.0.0.1     07030  Jan 4
erica meyers    1234 Smith Lane Hoboken NJ  127.0.0.1    07030  March 2
erica meyers    1234 Smith Lane Hoboken NJ  192.168.1.1  07030  april 5

This is where I am currently at (read csv, separate into columns by tab delimiter, print first 100 lines ):

import csv
import re
        
Num = 5000
        
with open('test.csv', newline='', encoding="cp437", errors='ignore') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for i in range(Num):
        lines = next(reader)

First few lines of current output:

['john smith', '1132 Anywhere Lane Hoboken NJ', '10.0.0.1', ' 07030', 'Jan 4']
['john smith', '1234 Smith Lane Hoboken NJ', '127.0.0.1', ' 07030', 'March 2']
['smith john', '1234 Smith Lane Hoboken NJ', '192.168.1.1', ' 07030', 'april 5']

here is where i am stuck...

I want to search the regex \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} against all columns on every line and output the column index number that had the most regex matches.

for this example \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} would match

10.0.0.1
127.0.0.1
192.168.1.1

so my desired output would be:

2
HannaWhite3
  • 23
  • 1
  • 3

1 Answers1

1

You can do it with pandas like this

df=pd.read_csv(path, nrows=5000, sep="\t")

Write a function to check if regex matches.

def check_regex_matches(x):
    
    return bool(re.match(regex, x))

Then you can use

list_of_bools_where_regex_matches = df["some_col"].apply(lambda
 x:check_regex_match(x))

df["some_col"][list_of_bools_where_regex_matches].index 

Note: Please check if you require re.match or re.search https://stackoverflow.com/a/12595082/4213362

Vishesh Mangla
  • 664
  • 9
  • 20
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/219487/discussion-on-answer-by-vishesh-mangla-python-csv-regex-all-columns-and-output-c). – Samuel Liew Aug 09 '20 at 11:33