1

I have a list of 77 items. I have placed all 77 items in a text file (one per line).

I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).

Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.

Question:

  • In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
  • Is there a method I can use to do this upon reading the file in?

Here is how I am reading the text file into python:

with open("name_list2.txt", "r") as myfile:
        policy_match_list = myfile.readlines()

policy_match_list = [x.strip() for x in policy_match_list]

Note - "policy_match_list" is the list of 77 policies read in from the text file.

Here is how I am comparing my two lists:

    for policy_name in policy_match_list:
        for us_policy in us_policies:
            if policy_name == us_policy["name"]:
                print(f"Match #{match} | {policy_name}")
                match += 1

Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to

Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"

I hope this makes sense, let me know if I can add any further info

Cheers!

Ryan Brown
  • 47
  • 1
  • 8
  • Have you tried decoding all lines read from the file? – Abhinav Mathur Oct 14 '20 at 11:51
  • Please can you provide some example code for this? I tried this, but it ended up removing about 30 items, which I didn't understand why or how either! – Ryan Brown Oct 14 '20 at 12:07
  • If you can, upload the file somewhere where we can replicate the issue in an attempt to solve it – Abhinav Mathur Oct 14 '20 at 12:31
  • I can't upload the file due to the sensitivity of the data, but it is literally a plain text file in which I have copy and pasted 77 items from excel (all cells text format) into an empty text file. – Ryan Brown Oct 14 '20 at 12:47
  • I have just replicated exactly that again (Excel -> Notepad -> new plain text file), and it has fixed the issue. I have no idea why this has happened, and why it broke the first time, but thank you for looking into this – Ryan Brown Oct 14 '20 at 12:50
  • Posted an answer for the same, check it out – Abhinav Mathur Oct 14 '20 at 12:54

2 Answers2

1

Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process. Try doing:

with open("name_list2.txt", "r", encoding="utf-8") as myfile:

Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?

Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.

Nimrod Rappaport
  • 134
  • 1
  • 1
  • 12
  • 1
    Thanks for this, yeah I have tried this and it didn't solve the issue sadly – Ryan Brown Oct 14 '20 at 12:07
  • This isn't a straightforward encoding issue, so this wouldn't solve the issue at hand – Abhinav Mathur Oct 14 '20 at 12:08
  • I just searched the code for what you provided in the comparison example, and the letter is "LATIN CAPITAL LETTER A WITH CIRCUMFLEX" which means that they might want a script that just strips accents. If you want to be jank you might want to use a Regex expression to parse this but I don't have one that is complex enough for that (it's not easy because there is no distinction between a Unicode "u" to a regular "u") on the top of my head wouldn't recommend. – Nimrod Rappaport Oct 14 '20 at 12:12
  • So this is interesting. When I tried decoding earlier as suggested in Abhinav's comment, I lose about 30 odd items from the list, but it does add in that character you describe randomly across other items. However, again, I don't know where that character comes from, since it's not in the text file. – Ryan Brown Oct 14 '20 at 12:16
  • that's weird. perhaps try to paste your file to a website such as this one: https://www.soscisurvey.de/tools/view-chars.php that detects invisible characters. – Nimrod Rappaport Oct 14 '20 at 12:18
  • Okay, I just pasted the contents of my text file into that site, and the only hidden stuff is CR LF at the end of each line – Ryan Brown Oct 14 '20 at 12:24
  • I have just replicated file creation again (Excel -> Notepad -> new plain text file), and it has fixed the issue. I have no idea why this has happened, and why it broke the first time, but thank you for looking into this – Ryan Brown Oct 14 '20 at 12:50
  • No problem! Sorry I couldn't get fixed, and happy for you that it worked anyways! – Nimrod Rappaport Oct 14 '20 at 13:02
0

Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:

  1. Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
  2. Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.
Abhinav Mathur
  • 7,791
  • 3
  • 10
  • 24