0

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.

Ex: INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');

i want to remove only the weird characters and leave all of the normal ones

I tried using the following code to do so:

import fileinput
import string

fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")

for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
    print(line)
    s = line
    printable = set(string.printable)
    filter(lambda x: x in printable, s)
    print(s)
    file.write(s)

but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.

Any advice or tips on how to solve this would be useful

GLHF
  • 3,835
  • 10
  • 38
  • 83
Big_VAA
  • 754
  • 1
  • 6
  • 11
  • those are not `weird characters`. try reading the file using an encoding like this: `fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8"))` or `open(filename, 'r', encoding="utf-8")` – muratgu May 27 '16 at 02:21
  • Which version of Python are you using: 2 or 3? They handle character-set conversion (and illegal bytes in encoded character data) differently. As muratgu pointed out, you're probably making an assumption about the character encoding of `text1.txt` that isn't true. – Kevin J. Chase May 27 '16 at 03:33

2 Answers2

2
    import string

strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;:  """
for x in strg:
    if x in string.ascii_letters or x in string.digits or x in acc:
        newstrg += x
print (newstrg)

Output;

'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>

You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.

Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

GLHF
  • 3,835
  • 10
  • 38
  • 83
  • Hey i tried that, however it removes all of the other formatting like periods, and quotes and so on, is there any way to handle that? – Big_VAA May 27 '16 at 02:23
  • @VashistaAmarjit You said you split them, so if you split them you could work with specific elements. What's your input type? List? Str? – GLHF May 27 '16 at 02:25
  • imput type is string – Big_VAA May 27 '16 at 02:39
  • Tell me what you want to keep excatly (periods quotes etc.) Tell me like : "%&'@ then I'll edit my answer. – GLHF May 27 '16 at 02:47
  • I want to keep single and double quotes, commas, colons, semi colons , and spaces, periods, and this character `. Thanks a lot i appreciate it a lot! – Big_VAA May 27 '16 at 02:54
  • @VashistaAmarjit Edited, if it's solved your problem please don't forget to check it as answer. – GLHF May 27 '16 at 02:57
-1

You can use regular expressions sub() to do simple string replacements. https://docs.python.org/2/library/re.html#re.sub

# -*- coding: utf-8 -*-

import re

dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)

print clean_string
# Outputs
>>> sinformation
David Lai
  • 814
  • 7
  • 13
  • how do i do the same but include commas, quotes , and spaces? – Big_VAA May 27 '16 at 02:45
  • just add those characters in the regular expression. Find the set of characters you want to accept, than use [^...] to negate that in the replacement expression. – David Lai May 27 '16 at 02:58
  • it seems like when i add the quotes it gives an error, should i add another carat before it? – Big_VAA May 27 '16 at 03:01
  • if you ad quotes, you need to escape it. Be careful though, you may need to escape certain characters 2x (1 in the python string, and in the regex expression) http://stackoverflow.com/questions/280435/escaping-regex-string-in-python – David Lai May 27 '16 at 03:06