0

I'm attempting to replace certain words (essentially the 2nd word of each line) in a text file, and then write them back out to either a new file, or overwrite the existing file.

I thought I was making progress, but when I went to write to a new file, I got an error saying I couldn't write a list to a text file. I can't simply replace a word for another word, because I have an 'else' clause that covers any word that doesn't match the others I need to replace.

Here's an example of the text I'm trying to modify, this text is contained in a .txt file:

id int,
organization_id int,
billing_month date,
fee_type varchar(100),
rate float,
price float,
uom varchar(25),
amount float,
currency_code_id float,
process_ts timestamptz NOT NULL DEFAULT (now())::timestamptz(6)

I'd like to change:

'int' --> 'BIGINT'
'numeric' --> 'DOUBLE'
'float' --> 'DOUBLE'
ELSE other data type --> 'STRING' .  

In the original data, notice that some have other characters, such as "varchar(100)" - I'd like to replace those with "STRING" and eliminate the '(100)' piece as well.

And then either overwrite or create a new text file. So the above example output if replaced properly would be:

id BIGINT,
organization_id BIGINT,
billing_month STRING,
fee_type STRING,
rate DOUBLE,
price DOUBLE,
uom STRING,
amount DOUBLE,
currency_code_id DOUBLE,
process_ts STRING

I'm having trouble knowing if I should be creating lists, and then modifying them, and then writing those lists to the text file, or dictionaries, or some other method I'm not thinking of. I'm very much a beginner so apologies if this isn't very clear.

phenderbender
  • 625
  • 2
  • 8
  • 18
  • Does the solution need to be in python? Perl could do the trick as well; it's more suited to string replacement. – Nick Reed Oct 22 '19 at 15:31
  • yes I am 100% unfamiliar with Perl unfortunately so if this is possible in Python I'd like to stick to that – phenderbender Oct 22 '19 at 15:33
  • I just added the 'process_ts' to my list of strings to replace as well, because this is another edge case I need to consider. Not all of my string lines are simply 2 words long, if that matters – phenderbender Oct 22 '19 at 15:41

3 Answers3

1

Contents of txt.txt:

id int,
organization_id int,
billing_month date,
fee_type varchar(100),
rate float,
price float,
uom varchar(25),
amount float,
currency_code_id float,
process_ts timestamptz NOT NULL DEFAULT (now())::timestamptz(6)

Code:

with open('txt.txt', 'r') as f:
    text = f.read().splitlines()

mapping = {'int':'BIGINT',
           'numeric':'DOUBLE',
           'float':'DOUBLE'}

replaced_text = []
for line in text:
    # temporarily remove comma
    line = line.rstrip(',')
    split_line = line.split()
    other_text, dtype = split_line[0], split_line[1:]
    new_dtype = mapping.get(' '.join(dtype), 'STRING')
    new_line = '{} {},\n'.format(other_text, new_dtype)
    replaced_text.append(new_line)


with open('txt_replaced.txt', 'w') as f:
    f.writelines(replaced_text)

Contents of txt_replaced.txt:

id BIGINT,
organization_id BIGINT,
billing_month STRING,
fee_type STRING,
rate DOUBLE,
price DOUBLE,
uom STRING,
amount DOUBLE,
currency_code_id DOUBLE,
process_ts STRING,
Dan
  • 1,575
  • 1
  • 11
  • 17
  • thanks, I just edited my original post to include "process_ts timestamptz NOT NULL DEFAULT (now())::timestamptz(6)" as a line I need to edit. Will this change your solution at all knowing that some lines of strings are more than 2 strings long? The fact that not all lines I need to edit are 2 strings long is giving me the most trouble – phenderbender Oct 22 '19 at 15:41
  • Thanks for your help - I'm still having trouble with the 'process_ts' piece though, I need that to read: "process_ts STRING", without all of the other stuff in between (this stuff: "timestamptz NOT NULL DEFAULT (now())::timestamptz(6)") - is there a way to eliminate that? – phenderbender Oct 22 '19 at 15:50
  • i see, so the field name is always one string with no spaces, but the data type might have any number of spaces? – Dan Oct 22 '19 at 15:52
  • @phenderbender i've updated the code to handle the edge case – Dan Oct 22 '19 at 15:58
  • thank you so much, confirming this looks to be working as intended. I'm a bit lost on the logic behind the new_dtype being able to handle a data type not included in your 'mapping' dictionary. Is there a simple way to explain that piece to a noob like myself? – phenderbender Oct 22 '19 at 16:06
  • the important bit of logic is the `get` method, see here: https://stackoverflow.com/questions/11041405/why-dict-getkey-instead-of-dictkey – Dan Oct 22 '19 at 16:08
  • essentially, if a key isn't present in the dict (such as `varchar(25)`), looking up that key with `get` will return the default value (which is set to `STRING`) – Dan Oct 22 '19 at 16:09
  • thanks so much for that link for a further explanation, makes sense now – phenderbender Oct 22 '19 at 16:56
1

You can iterate through each line and use a dictionary to replace the values in the second position of each line. This works for lines of any length, as long as the text to replace is the second word.

#vals to replace
replace_vals = {'int':'BIGINT', 'numeric':'DOUBLE', 'float':'DOUBLE'}

#file we write to
with open('out.txt', 'w') as outfile:
  #file we read from
  with open ("in.txt", 'r') as infile:
    #check each line
    for line in infile:
      #split line into words
      words = line.split()
      #get the first word and then replace the second word, defaulting to STRING
      w = words[0] + " " + replace_vals.get(words[1], 'STRING')
      #add a final newline
      w += "\n"
      #print to file
      outfile.write(w)

Demo

Nick Reed
  • 4,989
  • 4
  • 17
  • 37
  • thanks for the reply but unfortunately the word to replace isn't always the 2nd word in this example, (particularly the process_ts piece). – phenderbender Oct 22 '19 at 16:15
  • Do you mean the word to replace might occur in, say, position 3, or do you mean that if there's more than two words, replace them *all*? i.e. will I ever get `some var int` – Nick Reed Oct 22 '19 at 16:17
  • for cases (such as: "process_ts timestamptz NOT NULL DEFAULT (now())::timestamptz(6)" the 2nd word will let me know what I need to replace it with, and then I need to not only replace the 2nd word, but also any trailing text after that 2nd word as well. so, "process_ts timestamptz NOT NULL DEFAULT (now())::timestamptz(6)" would need to be "process_ts STRING" after being converted – phenderbender Oct 22 '19 at 16:18
  • Glad to help - please consider upvoting and accepting the answer if it addresses your question. – Nick Reed Oct 22 '19 at 16:24
  • upvoted but I'm still below the threshold for it counting since I'm new on the site. Hopefully that still helps! – phenderbender Oct 22 '19 at 16:56
0

Maybe something like this may help you:

import os, sys
from re import match as regexSearch

path = os.path.dirname(__file__)
myFile = open(os.path.join(path, "filename.txt"), "r")

regExpr = r"[\w_]+ ([\w\(\)\d:]+)(,|\s)"


with open(os.path.join(path, "newFile.txt"), "w") as f:
    for line in myFile.readlines():
        match = regexSearch(regExpr, line)
        if match:
            result = match.group(1)
            if "int" in result:
                f.write(line.replace(result, "BIGINT"))
            elif result in ["numeric", "float"]:
                f.write(line.replace(result, "DOUBLE"))
            else:
                f.write(line.replace(result, "STRING"))
        else:
            print("couldn't find something in line:\n", line)
    f.close()

Sofien
  • 478
  • 3
  • 12