I have a 40GB text file contain lines as follow:
55655653:foo
6654641:balh2
I've written a batch script to find and replace/remove :foo and only keep the number before that.
Batch script :
@echo on
((for /f "tokens=1 delims=:" %%b in (C:\data.txt) do ( echo %%b)) >C:\dataFinal.txt
)
pause
The problem of batch is that it is not able to read the big 40GB file
So I decided to write Python code to do the same :
f1 = open('data.txt', 'r')
f2 = open('dataFinal.txt', 'w')
for line in f1:
f2.write(line.replace(':', ''))
f1.close()
f2.close()
What I'm missing here is how to specify the text after the: to be also removed, for batch file it is tokens=1 delims=:
Please note the file size
I've generated the 40GB file using Java code (maybe this info can help us with something):
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
while (in.ready()) {
String line = in.readLine();
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(initializeKeyPair(line).toString() + ":" + line );