Most efficient way to strip forbidden characters in file name from Unicode string

Question

I have a string which contain some data I parse from the web, and make a file named after this data.

string = urllib.urlopen("http://example.com").read()
f = open(path + "/" + string + ".txt")
f.write("abcdefg")
f.close()

The problem is that it may include one of this characters: \ / * ? : " < > |. I'm using Windows, and it is forbidden to use those characters in a filename. Also, string is in Unicode formar which makes most of the solutions useless.

So, my question is: what is the most efficient / pythonic way to strip those characters? Thanks in advance!

Edit: the filename is in Unicode format not str!

http://stackoverflow.com/questions/1033424/how-to-remove-bad-path-characters-in-python — NPE, Dec 25 '14 at 12:25
@NPE Sorry! I googled before but find nothing. Anyway, maybe there are better solutions so I'll keep it up — ohad987, Dec 25 '14 at 12:27

score 16 · Answer 1 · answered Dec 25 '14 at 12:27

16

we dont know how your data look like:

But you can use re.sub:

import re
your_string = re.sub(r'[\\/*?:"<>|]',"","your_string")

answered Dec 25 '14 at 12:27

Hackaholic

19,069
5
54
72

The backslash (`\ `) should be escaped. (`'[\\\\/*?:"<>|]'` or `r'[\\/*?:"<>|]'`). Otherwise backslashes will not be removed. – falsetru Dec 25 '14 at 12:29
yes it is, thanks for notifying me – Hackaholic Dec 25 '14 at 12:31
I forgot to mention I'm using unicode, and this solution does'nt work with Unicode strings. Thanks anyway! – ohad987 Dec 25 '14 at 12:43

Vishnu Upadhyay · Accepted Answer · 2018-03-05T10:39:13.323

11

The fastest way to do this is to use unicode.translate,

see unicode.translate.

In [31]: _unistr = u'sdfjkh,/.,we/.,132?.?.23490/,/' # any random string.

In [48]: remove_punctuation_map = dict((ord(char), None) for char in '\/*?:"<>|')

In [49]: _unistr.translate(remove_punctuation_map)Out[49]: 

u'sdfjkh,.,we.,132..23490,'

To remove all puctuation.

In [46]: remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

In [47]: _unistr.translate(remove_punctuation_map)
Out[47]: u'sdfjkhwe13223490'

edited Mar 05 '18 at 10:39

answered Dec 25 '14 at 12:27

Vishnu Upadhyay

5,043
1
13
24

1

It's probably the most efficient solution out there. Thanks! – ohad987 Dec 25 '14 at 12:33
@ohad987 Please mark correct if you find this correct and helping. – Vishnu Upadhyay Dec 25 '14 at 12:34
There's a problem- I forgot to mention the I'm using Unicode, which makes this solution useless for me. But, it really great and could help others that use `str` instead. – ohad987 Dec 25 '14 at 12:42
@ohad987 Then use `str()` builtin to change unicode to string. – Vishnu Upadhyay Dec 25 '14 at 12:45
This string cotain may contain hebrew/arabic/etc chracters, and using `str()` will throw an exeption. – ohad987 Dec 25 '14 at 12:47
@ohad987 I think i solved your problem of unicode. check updated solution. – Vishnu Upadhyay Dec 25 '14 at 13:07

Most efficient way to strip forbidden characters in file name from Unicode string

2 Answers2