0

I have a collections of text files, and I want to replace certain characters within them with another character using Python. The characters I wish to replace are from the Welsh language, and they are digraphs. They are two separate characters, which form a single letter.

Here are some Welsh digraphs and some characters they could be replaced by:

ch - ƒ (ASCII code 131)
dd - Œ (ASCII code 140)
ff - ¤ (ASCII code 164)

The text files I will be working with may be fairly large (a few GB's) and there are 8 digraphs; in total there will be 24 replacement characters needed to cover all forms (ch, Ch, CH). I was wondering what would be an effective and efficient way to implement these replacements?

UPDATE:

I have a working (so far) version of a program which was based on an answer from this question:

replacing text in a file with Python

Here is my code:

replacements = {'ch':'ƒ', 'Ch':'†', 'ff':'¤', 'FF':'¦', 'Dd':'•', 'll':'º', 'Ll':'¿'}
print("Input file location: ")
inLoc = input("> ")
print("Output file location: ")
outLoc = input("> ")

with open(inLoc, "r") as infile, open(outLoc, "w") as outfile:
    for line in infile:
        for src, target in replacements.items():
            line = line.replace(src, target)
        outfile.write(line)

Input text:

Ydych Chi'n hoffi COFFI?

Dda de. Lle why ti llywelyn?

Output text:

Ydyƒ †i'n ho¤i CO¦I?

•a de. ¿e why ti ºywelyn?
Community
  • 1
  • 1
hjalpmig
  • 702
  • 1
  • 13
  • 39
  • What's your python version? And have you tried anything yet? – Mazdak Apr 11 '16 at 15:13
  • 1
    When you say there are digraphs do you mean that you text is encoded in utf8 and those characters take two octets ? – Luc DUZAN Apr 11 '16 at 15:22
  • 2
    The character codes listed in the question (131, 140, 164) are not ASCII. ASCII only uses codes up to 127. Neither are they Unicode (e.g., code 131 in Unicode is a special thing meaning "please don't put a break here"). Nor are they ISO 8859-1 nor ISO 8859-15 (which have commonly used European characters). Nor ISO 8859-14 (which is maybe the ISO 8859 set best suited for Welsh). So what are they? – Gareth McCaughan Apr 11 '16 at 15:25
  • @Kasramvd My python version is 3.4.2 - Im trying a couple of different things - not sure which is the best option for efficiency – hjalpmig Apr 11 '16 at 15:25
  • 1
    @GarethMcCaughan I used them as reference from this website http://www.ascii-code.com/ - under extended ASCII codes – hjalpmig Apr 11 '16 at 15:27
  • 1
    Luc, I'm pretty sure the meaning is that some pairs of (plain ol' ASCII) characters denote a single Welsh letter. E.g., two "d"s in a row in Welsh denote a sound like the one at the start of the English word "the". The pairs of ordinary Latin letters are the usual way to represent these in Welsh text, so I'm not quite sure why hjalpmig wants to do this -- perhaps just for convenience in future processing (with, presumably, a translation back to the usual digraphs for human consumption at the end). – Gareth McCaughan Apr 11 '16 at 15:28
  • @LucDUZAN I mean as a part of the Welsh alphabet Ch is one letter represented by two characters (C and h) However in my text files these letters are represented by two seperate characters as if you were writing them in English. The purpose of this question is so that I may merge these two characters into a single one. – hjalpmig Apr 11 '16 at 15:29
  • 1
    OK. So I'm afraid that website's terminology is a bit strange. Your characters are actually from something called CP-1252, which is a Microsoft-only thing similar (but not identical) to the ISO standard ISO 8859-1. Neither of them is the same thing as ASCII. – Gareth McCaughan Apr 11 '16 at 15:30
  • did you try `s.replace('\x8c', 'dd')`? – tobilocker Apr 11 '16 at 15:30
  • tobilocker, hjalpmig is looking for the opposite of that. – Gareth McCaughan Apr 11 '16 at 15:30
  • did you try `re.sub("[c,C][h,H]",chr(131),text)`? – Tadhg McDonald-Jensen Apr 11 '16 at 15:32
  • hjalpmig, one other question: Is this a thing that's going to be done many times, or just occasionally? Because you may save more time by just doing the easiest thing than you would by looking hard for an extra-efficient solution. (In this case the easiest thing is probably something like this: read the file in line by line; for each line do 24 `.replace()`s and then write it out.) – Gareth McCaughan Apr 11 '16 at 15:32
  • Oh, another thing. Are you sure you want to do these replacements absolutely everywhere possible? Or might your files contain snippets of other languages, computer code, etc., that shouldn't be messed with? – Gareth McCaughan Apr 11 '16 at 15:34
  • @GarethMcCaughan Yes this is what I was initially thinking (havign 24 replace statements) but I thought there might be a better way of doing it but if it isn't too slow it may be what I need. Also a slightly unrelated note - do you have a good reference for any character codes I can use to replace? Thought the website I used before would have been reliable but it seems not. – hjalpmig Apr 11 '16 at 15:34
  • What kind of reliability do you actually care about here? I mean, it looks to me as if your replacement characters are pretty much arbitrary -- they don't have any connection with the Welsh letters you're using them to represent. So why does it matter which character codes you use or what else they're used for? (I don't think that website is unreliable, exactly; it's just that it doesn't make the distinction between ASCII, ISO 8859-1, and CP1252 as clear as it could.) – Gareth McCaughan Apr 11 '16 at 15:36
  • @GarethMcCaughan I believe they should be everywhere. As some more background info - I'm attempting this simply as a pre-compression step and so all changes made will be reverted after compression. – hjalpmig Apr 11 '16 at 15:37
  • Maybe it's me that is stupid but would be surprised there is something more efficiant than str.replace. – Luc DUZAN Apr 11 '16 at 15:40
  • I'd probably iterate through all the characters doing if blocks on each, I'd imagine that'd be faster than a whole bunch of replaces (as this would only do one loop). – user161778 Apr 11 '16 at 15:43
  • If you're going to be compressing these, then I would strongly suggest just not bothering with this transformation at all. Something like gzip will do a much better job of finding short strings that can profitably be replaced with something smaller than you will do by hand. – Gareth McCaughan Apr 11 '16 at 15:43
  • @GarethMcCaughan Its more for research purposes. I want to see if a large text file compressed by gzip with this pre compression step will provide any better results than if it were just compressed by gzip. I've got a working version now check my update – hjalpmig Apr 11 '16 at 15:47
  • 1
    @hjalpmig I think you want the answer from [How can I do multiple substitutions using regex in python?](http://stackoverflow.com/questions/15175142/how-can-i-do-multiple-substitutions-using-regex-in-python) – Tadhg McDonald-Jensen Apr 11 '16 at 15:59
  • If this is for research purposes then I think you should do it the simplest, easiest to implement, hardest to get wrong, way. Then if it turns out to be helpful you can make a fast implementation (which quite likely will not be in Python, and whose structure may be somewhat unlike anything you'd want to do in Python) for actual use in whatever your compression application is. – Gareth McCaughan Apr 11 '16 at 16:52

0 Answers0