1

I have a 40GB text file contain lines as follow:

55655653:foo

6654641:balh2

I've written a batch script to find and replace/remove :foo and only keep the number before that.

Batch script :

 @echo on

 ((for /f "tokens=1 delims=:" %%b in (C:\data.txt) do ( echo %%b)) >C:\dataFinal.txt
 )
pause

The problem of batch is that it is not able to read the big 40GB file

So I decided to write Python code to do the same :

f1 = open('data.txt', 'r')
f2 = open('dataFinal.txt', 'w')
for line in f1:
    f2.write(line.replace(':', ''))
f1.close()
f2.close()

What I'm missing here is how to specify the text after the: to be also removed, for batch file it is tokens=1 delims=:

Please note the file size

I've generated the 40GB file using Java code (maybe this info can help us with something):

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
while (in.ready()) {
   String line = in.readLine();
   PrintStream out = new PrintStream(System.out, true, "UTF-8");
   out.println(initializeKeyPair(line).toString() + ":" + line );
Phoenix
  • 1,045
  • 1
  • 14
  • 22
xhxx
  • 39
  • 2
  • 11
  • 1
    possible duplicate of [Python string.replace regular expression](http://stackoverflow.com/questions/16720541/python-string-replace-regular-expression) – Paco Abato Jan 30 '15 at 07:28
  • Are you taling about Windows? –  Jan 30 '15 at 07:58
  • If you created it via Java code - why don't you just re-run it and remove the `+ ": " + line` ? – Jon Clements Jan 30 '15 at 08:21
  • @Jon Clements i needed to have tow copies of the file one with Number:Text and one with only Number, generating this file took around 4 days. i just realized that i could add another PrintStream out to save only initializeKeyPair(line).toString() (the Number) , so it will write both files , but too late now. – xhxx Jan 30 '15 at 08:39

3 Answers3

3

You can use str.partition to split the number before the first :

with open('data.txt') as fin, open('dataFinal.txt', 'w') as fout:
    fout.writelines(line.partition(':')[0] + '\n' for line in fin)

Not we're using with here so files are automatically closed and a generator expression to loop over fin split the line, take up to the first : then write it back to fout with a newline appended.

You may wish to specify the encoding:

import io

with io.open('/usr/share/dict/words', encoding='utf-8') as fin, io.open('dataFinal.txt', 'w', encoding='utf-8') as fout:
    fout.writelines(line.partition(':')[0] + '\n' for line in fin)
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thank you, This code works, however it only wrote 140kb of the original 40GB file, maybe this cant read a big text file ? – xhxx Jan 30 '15 at 07:42
  • @xhxx do you have any binary data in the file? Like an EOF marker? You might want to try `open('data.txt', 'rb')` and see what happens... – Jon Clements Jan 30 '15 at 07:43
  • open('data.txt', 'rb') gave the same result, the file contains many different chars and encoding (china, Korean, France letters++, all keyboard symbols you can think of) the text file is UTF-8 – xhxx Jan 30 '15 at 07:56
  • @xhxx that's quite a significant bit of information to omit from your question (can you [edit](http://stackoverflow.com/posts/28231175/edit) your post to include that) and also state which version of Python you're using? – Jon Clements Jan 30 '15 at 08:14
  • is there a way i can dump the first 7000 lines of the 40GB file to new text file ? so i can know if there is something that causes it to stop at line 6700? as i cant open the 40GB file in anyway using notepad, notepad++ and vim (already tried) – xhxx Jan 30 '15 at 08:16
  • 1
    @xhxx `from itertools import islice` then make the line above to end `for line in islice(fin, 7000)` – Jon Clements Jan 30 '15 at 08:17
  • do you mean like this : `from itertools import islice with open('RealuniqFULL.txt') as fin, open('dataFinal.txt', 'w') as fout: fout.writelines(for line in islice(fin, 7000))` Error says : ` fout.writelines(for line in islice(fin, 7000)) ^ SyntaxError: invalid syntax` – xhxx Jan 30 '15 at 08:32
  • @xhxx no: `fout.writelines(line.partition(':')[0] + '\n' for line in islice(fin, 7000))` – Jon Clements Jan 30 '15 at 08:35
  • @xhxx anyway... I've edited the answer to include a version that explicitly specifies the encoding to use... try that – Jon Clements Jan 30 '15 at 08:43
  • same issue just 6700 lines(140kb), what i meant but dump first 7000 lines is, the original number:word like copy paste the first 7000 lines – xhxx Jan 30 '15 at 08:44
  • 2
    @xhxx try the version with the encoding – Jon Clements Jan 30 '15 at 08:50
  • Thank you very much, the encoding='utf-8' solved the problem ! thanks mate – xhxx Jan 30 '15 at 08:55
2

You may easily process a data file of any size via a Batch file with this method:

@echo off

rem Use a subroutine to read from C:\data.txt and write to C:\dataFinal.txt
rem the subroutine must be in a separate .bat file and must be called via CMD.EXE

cmd /C call ProcessFile.bat  < C:\data.txt  > C:\dataFinal.txt
pause

This is ProcessFile.bat:

@echo off
setlocal EnableDelayedExpansion

rem Process lines of input file in an endless loop
for /L %%i in ( ) do (

   rem Read next line and check for EOF
   set "line="
   set /P "line="
   if not defined line exit /B

   rem Process line read
   for /F "delims=:" %%b in ("!line!") do echo %%b

)

Note that this method ends reading the input file at the first empty line, but this point may be fixed, if needed.

Aacini
  • 65,180
  • 12
  • 72
  • 108
1

You should use line.split():

>>> line = '55655653:foo'
>>> line, _ = line.split(':', 1)
>>> print(line)
55655653

Note that will also cut tailing '\n' so you should add it manually (or use print). Also, such line, _ = line.split(':', 1) could raise exception of : is not in the line.

So your code would like something like this:

f1 = open('data.txt', 'r')
f2 = open('dataFinal.txt', 'w')
for line in f1:
    line, _ = line.split(':', 1)
    f2.write(line + '\n')
f1.close()
f2.close()

(note that Jon Clements provided prettier way to work with files).

myaut
  • 11,174
  • 2
  • 30
  • 62
  • Thank you, how i can edit the code to read from a file ?. i've edited this f2.write(line.replace(':', '')) with f2.write(line, _ = line.split(':', 1)) but Error says : f2.write(line, _ = line.split(':', 1)) TypeError: write() takes no keyword arguments – xhxx Jan 30 '15 at 07:47
  • This code also worked, but same as Jon Clements provided, it only write 140KB of the file. – xhxx Jan 30 '15 at 08:08