1

I have the following script to import and export random TXT/CSV files from CLI, everything that passes has to be unique and case insensitive output in UTF-8, can I accomplish this with a set variable? I'm quite new to Python so every comment or suggestion is welcome!

This is my current script;

import hashlib
import sys


if len(sys.argv) < 3:
    print("Wrong parameter; script | inputfile | outputfile")
    sys.exit(1)

output_file_path = (sys.argv[2])
input_file_path = (sys.argv[1])

completed_lines_hash = set()

output_file = open(output_file_path, "w")

for line in open(input_file_path, "r")

  hashValue = hashlib.md5(line.rstrip().encode('utf-8')).hexdigest()

  if hashValue not in completed_lines_hash:
    output_file.write(line)
    completed_lines_hash.add(hashValue)

output_file.close()
Caloure
  • 19
  • 2
  • 2
    You'd just lowercase or uppercase your values, both when testing and storing. – Martijn Pieters Dec 14 '18 at 13:15
  • Could you provide a syntax example? I tried to add (map(str.lower to my set value but that's not working. – Caloure Dec 14 '18 at 13:52
  • `if hashValue.lower() not in completed_lines_hash:` and `completed_lines_hash.add(hashValue.lower())` – Martijn Pieters Dec 14 '18 at 14:00
  • However, I note that `.hexdigest()` already produces lower-case hex digits *only*, so there is no point in lowercasing these some more. Are you perhaps looking to lowercase your *inputs*, so the line you read from the file? – Martijn Pieters Dec 14 '18 at 14:16
  • I'd also not open the file as text, then encode again. Just open as binary, you can lowercase `bytes` objects too (limited to ASCII letters). I'm also not sure why you use hashing here, you could just store the stripped lines straight in the set, hashing adds no additional value. – Martijn Pieters Dec 14 '18 at 14:17
  • If your file contents are ASCII only, then the following would work: `with open(input_file_path, 'rb') as inf, open(output_file_path, 'wb') as outf:`, `seen = set()`, `for line in inf:` `test_line = line.rstrip().lower()`, `if test_line not in seen:` `outf.write(line)`, `seen.add(test_line)`. – Martijn Pieters Dec 14 '18 at 14:19
  • No special need for hashing, thought It could be a good way for the uniqueness , the files are all ASCII. – Caloure Dec 14 '18 at 14:26
  • When I use your suggested: if hashValue.lower() not in completed_lines_hash and completed_lines_hash.add(hashValue.lower()) I get a syntaxt error. – Caloure Dec 14 '18 at 14:27
  • Let me emphasise once more: there is no point in lowercasing the hex digest of the hash. It is already lowercase. The hash of `"This is a line"` is not going to change when lowercased, and won't match the hash for `"this is a line"`, because you didn't lowercase `"This is a line"`. – Martijn Pieters Dec 14 '18 at 14:28
  • If you need to match your *lines* case-insensitively, then lowercase the line, not the hash digest. – Martijn Pieters Dec 14 '18 at 14:29
  • I would hope and expect that you pay enough attention to the Python syntax to know what those pieces of code achieve, and avoid syntax errors. If you used the code in your comment literaly, then you are missing a `:` in the `if` statement, which is present in my version. – Martijn Pieters Dec 14 '18 at 14:30
  • I did and I used the : in the if statement as well as the with open you provided and with both I'm getting syntax errors, I'm assuming that what I want to achieve is to difficult for a beginner, thanks for your help. – Caloure Dec 14 '18 at 15:05

0 Answers0