2

I would like to find a string of certain length - example 7 characters. The string must only contain uppercase letters and numbers. I have ideas to: read the file line by line...

I am unsure the best practice here- read the whole file in one block or read the file line by line using a loop? Do you have to use a loop to read the file line by line?

# read lines in text file
filetoread=open("mytextfile.txt")

for lines in filetoread  # right ?
 #just an example of a given string of text (not from the file)
    characters = "D123456"
    for x in characters:
        if x == "D":
            print ("found letter", x)

But in my scenario I do not know what characters will be present in my 7 character length string so I can't search for "D" obviously.

So I have ideas I need to read the file, check for a string of length 7 (I am unsure how to handle stuff in the file like this:

line 1: My path = "7characters" (so basically finding even substrings that would qualify of 7 characters which contain uppercase and numeric

I dont know, this is simple, but yet I don't think i am understanding the basic logic behind it.

Fu Hanxi
  • 186
  • 1
  • 15
Imre
  • 51
  • 1
  • 1
  • 7
  • You can have your condition like first import this module: ```import string``` and then have a condition: ```if(x in string.ascii_uppercase or x in string.digits):``` – JenilDave Jun 16 '20 at 05:19
  • Are these letters and numbers in the ASCII alphabet only? Say `A-Z` plus `0-9`? – tdelaney Jun 16 '20 at 05:27
  • This link talks about an external module that could help: https://stackoverflow.com/questions/36187349/python-regex-for-unicode-capitalized-words – tdelaney Jun 16 '20 at 05:46

3 Answers3

2

Reading line by line would be an option in a super gigantic file. But for normal files it would be easier to just read the whole file at once.

My code is made for normal chars so no special Ë and Ô kind of letters.

import re

with open("somefile.txt") as file:
   data = file.read()
   result = re.findall(r'\b[A-Z0-9]{7}\b', data)
   print(result)

the regular expresion explained:

r'\b[A-Z0-9]{7}\b'
\b = beginning or end of a word
[A-Z] letter range: any letter from capital A to capital Z
[0-9] number range: any number from 0 to 9
{7} length of 7 chars of what is specified in front of it [A-Z0-9]
\b beginning or end of word
Gerrit Geeraerts
  • 924
  • 1
  • 7
  • 14
0

There are a lot of upper case letters and numbers in the unicode spec. This example will normalize each line of the file and then check each character's character class. If unicode says its upper case, it counts. (I assume emoji's won't have an upper case version...).

import unicodedata

def string_finder(filename, length=7):
    with open(filnname) as fp:
        return_chars = []
        for line in fp:
            line = unicodedata.normalize(line.strip())
            for c in line:
                category = unicodedata(c)
                if "LU" in category or "N" in category:
                    return_chars.append(c)
                    if len(return_chars) == length:
                        return "".join(return_chars)
    return None
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I wonder how to recursively search *.ini or wildcard files, say in folder D:\test-data\*.ini – Imre Sep 13 '22 at 12:30
  • @Imre - I don't see how that's related to my answer. Check out the glob module (https://docs.python.org/3/library/glob.html) for a way to do it. – tdelaney Sep 13 '22 at 21:29
0

In general, Regular Expressions (regex) are the most succinct and fastest way to search for strings that meet certain criteria within a file. I recommend using the RegEXR tool to develop the regular expression for each specific use case you might have. For your case (finding 7 consecutive uppercase or numeric characters in a file), I would do something like this:

import re

# with open("examplefile.txt") as f:
#     text = f.read()

# This is just an example, since I don't have your text file
text = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a G4LL3YS of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into ELEC7R0NIC typesetting, remaining essentially unchanged.
It was popularised in the 19601970s with the release of LETRASET sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
"""
# Searches fo the pattern in the sample text
found_patterns = re.findall(r'([A-Z\d]{7})', text)
# Could also use below, if you only want the first match
# found_patterns = re.search(r'([A-Z\d]{7})', text).group()
print(found_patterns)

Sam Jett
  • 710
  • 1
  • 6
  • 15