0

i have a Big file 50G i use this script to remove all point except the points after @ example file.tsv

 a.a.aabcd@mail.com
 bbbb.ccc.c@mail.com
 abdc@mail.com

my script :

import codecs
contents = codecs.open('file.tsv', encoding='utf-8').read()
sys.stdout=open("newFile.tsv","w")
print contents.replace('.','') 
sys.stdout.close();

Output :
 aaaabcd@mailcom
 bbbbcccc@mailcom
 abdc@mailcom

i want to return :

 aaaabcd@mail.com
 bbbbcccc@mail.com
 abdc@mail.com

remove all point except mail.com

i use linux commande to change it :

os.system('time sed -i \'s/@mailcom/@mail.com/g\' newFile.tsv');
Younes Zaidi
  • 1,180
  • 1
  • 8
  • 25
  • There are several ways to do this, a usual way would use a for loop to go over each line of the output and to process one line at a time. On each line there are several ways to figure out how to remove every period but the last one. Please have a go and then if you get stuck provide what you tried. – Andrew Allaire Oct 22 '21 at 16:06
  • The file size is 50G There are Millions of line i can't use loop – Younes Zaidi Oct 22 '21 at 16:08
  • Do you really mean "except the last one"? What about `a.b.c@foo.co.uk`? You want to change that to `abc@fooco.uk`? – Barmar Oct 22 '21 at 16:11
  • 1
    *The file size is 50G There are Millions of line i can't use loop* - of course you will use a loop. Also, as comments and the partial answer suggests, you may be better off via splitting along the @ character and removing dots from the first half, while keep the second one unchanged. – tevemadar Oct 22 '21 at 16:12
  • 1
    Going through one line at a time should be less of a problem with large files than loading the entire contents like you do now. – Andrew Allaire Oct 22 '21 at 16:13
  • @Barmar you are right in this case we need to keep all point after @ – Younes Zaidi Oct 22 '21 at 16:17
  • loop is bad solution for me is take a long time , i use seed command os.system('time sed -i \'s/@mailcom/@mail.com/g\' newFile.tsv'); but i want to do it by phyton when the file is open – Younes Zaidi Oct 22 '21 at 16:18
  • 1
    Load it into a pandas dataframe, using `@` as the column delimiter. Then you can replace all the `.` in the first column. – Barmar Oct 22 '21 at 16:19
  • thank you , checking – Younes Zaidi Oct 22 '21 at 16:23
  • You really have no choice but to process the file line-by-line, This will happen even if you use pandas, a regular expression, or something else to do the processing, it just might not be explicit. – martineau Oct 22 '21 at 17:21

1 Answers1

0

You can use regular expression

import re
mail = "a.a.aabcd@mail.com"
mail_split = mail.split("@")
newmail = re.sub("\.","", mail_split[0]) + f"@{mail_split[-1]}"
print(newmail)
>>>aaaabcd@mail.com
Surya Tej
  • 1,342
  • 2
  • 15
  • 25
  • 1
    How to apply this to the entire file? – Barmar Oct 22 '21 at 16:09
  • @Barmar Because it is a large file, you might have luck applying the replacement line by line as not to run out of memory: https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into – Jeremy Bailey Oct 22 '21 at 16:50
  • 1
    The OP claims that looping over the file will be too slow. – Barmar Oct 22 '21 at 16:51
  • My mistake, didn't read through all the comments. It might be possible to use threading to speed it up if doing it line by line is too slow. Other than that or splitting the file into smaller files, I can't think of any more options. https://stackoverflow.com/questions/11196367/processing-single-file-from-multiple-processes – Jeremy Bailey Oct 22 '21 at 17:34