2

What I intend to do ?

To perform a search for list of alphabetic string among a set of files on Windows File System (around 25K numbers of varying sizes and extensions primarily flat text files, biggest file being not more than few MB in size)

What I did to achieve this?

for each_file in files:
    file_read_handle = open(each_file,"rb")
    file_read_handle.seek(0) #ensure you're at the start of the file
    first_char = file_read_handle.read(1) #get the first character
    if first_char:
        file_read_content_mappd = mmap.mmap(file_read_handle.fileno(), 0, access=mmap.ACCESS_READ)
        if re.search(br'(?i)T_0008X_WEB', file_read_content_mappd):
            file_write_content = ('Text T_0008X_WEB found in {}'.format(each_file))
            file_write_handle.write(file_write_content)     
            file_write_handle.write("\n")
file_write_handle.close()

This piece of code works just fine for hardcoded text search (see line T_0008X_WEB) among files that are opened in binary mode ("rb") to avoid UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 776: character maps to undefined error.

However, when trying to search a list of values by replacing the hardcoded value with a variable like this-if re.search('br\'(?i)' + regex_search_str_byte + '\'', file_read_content_mappd):, have been facing following issues-

  1. When used: re.search('br\'(?i)' + regex_search_str + '\'', file_read_content_mappd): got error: File is in binary and search text is in string type
  2. When used: re.search(regex_search_str_byte, file_read_content_mappd): got issue: No match was found because even the regex characters br'(?i) were also considered as part of byte converted search text

Request guidance on how to perform byte converted text regex search for a list of values, on binary mode opened file read?

  • Looks like you need `if re.search(str.encode(regex_search_str), file_read_content_mappd)` – Wiktor Stribiżew Sep 13 '17 at 06:38
  • @WiktorStribiżew: In such case, how should we include the regex flag **br'(?i)** ? Have tried to do the same in case 2 mentioned in question, like, Tried to hold the _entire value including regex flags_ into the variable _regex_search_str_ and converted that string to byte and saved in _regex_search_str_byte_. I think you are suggesting the same with string encode to UTF-8 option, however, in this case it returned no match and I suppose the byte converted search text also considered the regex flags to be part of the search text. Suggestions specific to this would be more helpful. – Lakshmanan Chidambaram Sep 13 '17 at 06:46
  • 1
    `if re.search(str.encode(regex_search_str), file_read_content_mappd, flags=re.I)`. The flag can be passed as an argument to the `re.search` method. `br` are not necessary since they are used to modify a string literal, and you are using a variable. I assume `regex_search_str` is a UTF8 string. See [this question](https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3). – Wiktor Stribiżew Sep 13 '17 at 06:48
  • How did you create/assign `regex_search_str_byte`? – Wiktor Stribiżew Sep 13 '17 at 06:55
  • `regex_search_str_byte = bytes(each_string, 'utf-8')`. _each_string_ is an element of another python list of alphanumeric characters – Lakshmanan Chidambaram Sep 13 '17 at 06:59
  • Ok, does `re.search(regex_search_str_byte, file_read_content_mappd, flags=re.I)` find any matches? – Wiktor Stribiżew Sep 13 '17 at 07:04
  • Thanks for the `flags=re.I` option to replace **(?i)** in my search. Also, could you please let know if there is a similar option for **br'** as well (perform binary text search) – Lakshmanan Chidambaram Sep 13 '17 at 07:08
  • Working with this `re.search(regex_search_str_byte, file_read_content_mappd, flags=re.I)` now and will update in few mins. – Lakshmanan Chidambaram Sep 13 '17 at 07:09
  • I do not understand what you mean by *similar option for br'*: these prefixes are just string literal modifiers, they are no performing any search. – Wiktor Stribiżew Sep 13 '17 at 07:10
  • Thanks @WiktorStribiżew , your suggestion to use **flags=re.I** helped. I misunderstood on the string literal modifiers. – Lakshmanan Chidambaram Sep 13 '17 at 07:25
  • @WiktorStribiżew : You may please post your suggestion as an answer so that I could accept it. Thanks and have a good day! :) – Lakshmanan Chidambaram Sep 13 '17 at 07:26

1 Answers1

1

Use

re.search(regex_search_str_byte, file_read_content_mappd, flags=re.I)

The re.I flag can be passed as an argument to the re.search method. br prefixes are not necessary since they are used to modify a string literal, and you are using a variable.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563