How to check txt file for content

Question

I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->

# Python program to
# demonstrate reading files
# using for loop
import re

file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
    count += 1
    
    pattern = re.compile(r'^\.\w{2}\s')
    if pattern.match(line):
        print(line)
        countrycounter += 1
    else:
        print("fail", line)

        noncountrycount += 1

print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()

The txt file has this in it

.aaa    generic American Automobile Association, Inc.
.aarp   generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb    generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc    generic Disney Enterprises, Inc.
.able   generic Able Inc.
.abogado    generic Minds + Machines Group Limited
.abudhabi   generic Abu Dhabi Systems and Information Centre
.ac country-code    Internet Computer Bureau Limited
.academy    generic Binky Moon, LLC
.accenture  generic Accenture plc
.accountant generic dot Accountant Limited
.accountants    generic Binky Moon, LLC
.aco    generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor  generic United TLD Holdco Ltd.
.ad country-code    Andorra Telecom
.adac   generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads    generic Charleston Road Registry Inc.
.adult  generic ICM Registry AD LLC
.ae country-code    Telecommunication Regulatory Authority (TRA)
.aeg    generic Aktiebolaget Electrolux
.aero   sponsored   Societe Internationale de Telecommunications Aeronautique (SITA INC USA)

I am getting this error now File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to

It would be much more helpful if you could include a relevant snippet of the contents of your test file as part of your question. — , Apr 21 '21 at 18:28
what is the point of `import re` if You don't use it anyways? — Matiiss, Apr 21 '21 at 18:30
More important: `not got it to work` ➡️ What is the __error or output__ you got. Please [edit] and post the full output (be it errors or unexpected print results). — hc_dev, Apr 21 '21 at 18:32
You are splitting the strings on the first place there are three spaces. The *country codes* only have one space after them so that *logic* doesn't work. — wwii, Apr 21 '21 at 18:39
Read [this article](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/) for tips on debugging your code. — Code-Apprentice, Apr 21 '21 at 18:57

score 2 · Answer 1 · answered Apr 21 '21 at 18:36

Is this something You were looking for:

with open('lorem.txt') as file:
    data = file.readlines()

for line in data:
    temp = line.split()[0]
    if len(temp) == 3:
        print(temp)

In short:

file.readlines() in this case returns a list of all lines in the file, pretty much it split the file by \n.

Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.

hc_dev · Accepted Answer · 2021-04-21T21:34:36.300

It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.

Encoding error

The final hint was the console output (error, stacktrace) you pasted.

Read the stacktrace & research

This is how I read & analyze the error-output (Python's stacktrace):

... C:/Users/tyler/Desktop ...

... findcountrycodes/Test.py", line 15 ...

... Python36\lib\encodings*cp1252*.py ...

... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:

From this output we can extract important contextual information to research & solve the issue:

you are using Windows
the line 15 in your script Test.py points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
you are using Python 3.6 and the currently used encoding was Windows 1252 (cp-1252)
the root-cause is UnicodeDecodeError, a frequently occuring Python Exception when reading files

You can now:

research Stackoverflow and the web for this exception: UnicodeDecodeError.
improve your question by adding this context (as keywords, tag, or dump as plain output)

Try a different encoding

One answer suggests to use the nowadays common UTF-8: open(filename, encoding="utf8")

Detect the file encoding

An methodical solution-approach would be:

check the file's encoding or charset, e.g. using an editor, on windows Notepad or Notepad++
open the file your Python code with the proper encoding

Filtering lines for country-codes

So you want only the lines with country-codes.

Filtering expected

Then expected these 3 lines of your input file to be filtered:

.ad country-code    Andorra Telecom
.ac country-code    Internet Computer Bureau Limited
.ae country-code    Telecommunication Regulatory Authority (TRA)

Solution using regex

As you already did, test each line of the file. Test if the line starts with these 4 characters .xx (where xx can be any ASCII-letter).

Regex explained

This regular expression tests for a valid two-letter country code:

^\.\w{2}\s

^ from the start of the string (line)
\. (first) letter should be a dot
\w{2} (followed by) any two word-characters (⚠️ also matches _0)
\s (followed by) a single whitespace (blank, tab, etc.)

Python code

This is done in your code as follows (assuming the line is populated from read lines):

import re

line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
    print('found country-code')

Here is a runnable demo on IDEone

Further Readings

Filter list with regex
Python 3 documentation: Regular Expression HOWTO
Bharath Sivakumar, on Medium (2020): Extracting Words from a string in Python using the “re” module
koenwoortman's blog (2020): Remove None values from a list in Python

did You miss the part where country code is "ae" and "ac"? also this doesn't return the line does it (althought that wasn't asked for)? — Matiiss, Apr 21 '21 at 18:40
@Matiiss yes, thanks for reminding me: replaced the duplicate with the missed ones. — hc_dev, Apr 21 '21 at 18:45
when i try this it gives me an error Traceback (most recent call last): File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to — ShadowGunn, Apr 21 '21 at 19:49
@ShadowGunn This is exactly the __error output we need__. Add it to your question and we can help you _debugging_. — hc_dev, Apr 21 '21 at 19:57

score 1 · Answer 3 · answered Apr 21 '21 at 18:50

You are splitting on three spaces but the character codes are only followed by one space so your logic is wrong.

>>> s = '.ac country-code    Internet Computer Bureau Limited'
>>> s.strip().split('   ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>

Check if the third character is not a space and the fourth character is a space.

>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado    generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>