2

I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. I don't want to have to read the entire dataset and then modify as each file is over 100MB with over 4 million records.

Range   150.0dB -64.9dBm
Mobile unit 1   Base    -17.19968    145.40369  999.8
Fixed unit  2   Mobile  -17.20180    145.29514  533.0
Latitude    Longitude   Rx(dB)  Best unit
-17.06694    145.23158  -050.5  2
-17.06695    145.23297  -044.1  2

So lines 1,2 and 3 should be deleted and in line 4, "Rx(db)" should be just "Rx" and "Best Unit" be changed to "Best_Unit". Then I can use my other scripts to geocode the data.

I can't use commandline programs like grep (as in this question) as the first 3 lines are not all the same -the numbers (such as 150.0dB, -64*) will change in each file so you have to just delete the whole of lines 1-3 and then grep or similar can do the search-replace on line 4.

Thanks guys,

=== EDIT new pythonic way to handle larger files from @heltonbiker. Error.

import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset

infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()

del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"

Traceback (most recent call last): File "P:\2012\Job_044_DM_Radio_Propogation\Working\FinalPropogation\TRC_Emerald\working\clean_file_1c.py", line 13, in newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' object is unsubscriptable

Community
  • 1
  • 1
GeorgeC
  • 956
  • 5
  • 16
  • 40
  • 3
    Python is overkill. Use `sed`. – wim Feb 27 '12 at 23:18
  • Why is Python a requirement here. This seems like a great job for UNIX commands. Not sure what environment you're on, but yeah. – Brian Feb 27 '12 at 23:19
  • I am using arcpy in the follow up steps to geocode/buffer etc the dataset (see http://gis.stackexchange.com/questions/20892/buffer-point-dataset-based-on-compared-value for details). The issue with SED and GREP is that I don't see a way of deleting the first three non-unique lines -a simple search and replace won't work as per the last paragraph of my question (just added). – GeorgeC Feb 27 '12 at 23:27
  • Unless you edit in place all the characters from first 4 rows with spaces (not exactly what you asked for but might for what you need) or rewrite all the subsequent lines 4 lines up, truncate and resave the file (bad idea) your best bet is to go with Gordon's solution with `sed`. – enticedwanderer Feb 27 '12 at 23:57

3 Answers3

9

As wim said in the comments, sed is the right tool for this. The following command should do what you want:

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

To explain the command a little:

-i executes the command in place, that is it writes the output back into the input file

-e execute a command

'4 s/(dB)//' on line 4, subsitute '' for '(dB)'

'4 s/Best Unit/Best_Unit/' same as above, except different find and replace strings

'1,3 d' from line 1 to line 3 (inclusive) delete the entire line

sed is a really powerful tool, which can do much more than just this, well worth learning.

Gordon Bailey
  • 3,881
  • 20
  • 28
  • I'm not sure that will avoid reading the whole file as asked by the OP. – jcollado Feb 27 '12 at 23:33
  • 1
    That's a good point. It would be interesting to know how `sed` handles that kind of thing, although I hope it would be smart and ignore all lines other than 1-4. – Gordon Bailey Feb 27 '12 at 23:40
  • 1
    To some extent, you can't not read the entire file. How can you match something without reading it? Sed doesn't read the whole file into memory, so I think your answer is a good one. – prelic Feb 28 '12 at 00:38
  • I tried it and get an error which I can't figure out - C:\Radio Mobile\Output\Ian_Test>sed -i -e '4 s/(dB)//' it/' -e '1,3 d' trc_Longlands_10m.txt sed: -e expression #1, char 1: unknown command: `'' --- I tried it in individual components but it still gives same error. – GeorgeC Feb 28 '12 at 02:02
  • this is a total guess, but maybe try replacing `'4 s/(dB)//'` with `'4,4 s/(dB)//'` – Gordon Bailey Feb 28 '12 at 02:07
  • no luck. I also tried ` and ' - they give same result. C:\Radio Mobile\Output\Ian_Test>sed -i -e '4 s/(dB)//' trc_Longlands_10m.txt sed: -e expression #1, char 1: unknown command: `'' – GeorgeC Feb 28 '12 at 02:25
  • Did you try with replacing `4` with `4,4`? That's the only guess I have. If it's not that then I suspect it's some weirdness with whatever windows version of `sed` you are using. – Gordon Bailey Feb 28 '12 at 02:33
  • tried 4,4 as well...same issue. Why unrecognised command: `''? – GeorgeC Feb 28 '12 at 03:07
  • This is another shot in the dark, but maybe try removing all the spaces, e.g. `-e'4,4s/(dB)//'` – Gordon Bailey Feb 28 '12 at 03:11
  • Still no hurrah! I am using http://sourceforge.net/projects/gnuwin32/files/sed/4.2.1/sed-4.2.1-setup.exe/download is there a different version to use on win7 64 bit? – GeorgeC Feb 28 '12 at 03:16
  • I honestly have no experience with using these tools under windows. I am using that exact version of sed (`4.2.1`). Maybe it has something to do with whatever shell you are using (cygwin or something similar I assume?). It might be worth opening a new question for this, I doubt I can be much help here. – Gordon Bailey Feb 28 '12 at 03:20
  • stupid of me...I was trying to use sed in a dos window and not in cygwin. Thanks...sorry. Now I just need to figure out how to call this from an arcpy/model builder session. – GeorgeC Feb 28 '12 at 03:51
  • Oh, ok, I guess that would do it :). Good luck with the rest of it. – Gordon Bailey Feb 28 '12 at 03:56
  • Accepted solution as python failed when working with files over 500mb, even with all the solutions listed here. For smaller files cjrh solution was easier to integrate into my models. – GeorgeC Mar 23 '12 at 09:11
1

Just try it for each file. 100 MB per file is not that big, and as you can see, the code to just make an attempt is not time-consuming to write.

with open('file.txt') as f:
  lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx')
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
with open('output.txt', 'w') as f:
  f.write('\n'.join(lines))
Caleb Hattingh
  • 9,005
  • 2
  • 31
  • 44
  • 1
    First of all, with such a big file I would not read the whole file at once this way but rather use line reading in a generator loop. This way you don't have to use up 100 MB of RAM (plus the baggage PyObject and memory fragmentation add to it). – Frg Feb 28 '12 at 00:03
  • @Frg: how will you save the modifications? – Caleb Hattingh Feb 28 '12 at 00:44
  • It's not about not reading the rest of the file, but doing so on a line-by-line basis. My comment would probably make more sense if the file was more like 1 GB in size. – Frg Feb 28 '12 at 17:13
  • @cjrh -thank for this. It works great. How can I split the dataset into 1 million line blocks at the same time? arcgis is caughing at trying to buffer and process my files which are about 500mb with 10million lines. see http://stackoverflow.com/questions/9612882/split-a-large-text-xyz-database-into-x-equal-parts . BTW your python solution takes about the same time as cygwin (~3 mins) to run. – GeorgeC Mar 08 '12 at 04:36
  • My base txt databases grew to over 1gb now with 30m lines and the python option fails due to memory issues while the sed solution works with no issues. Wish I could select both your and @Gordon Bailey answer as I use both. Can you help fix the python? – GeorgeC Mar 11 '12 at 20:56
  • Can you modify the first few lines before the file grows too large? Also, I would put some effort into modifying whatever is creating the files in the first place. – Caleb Hattingh Mar 20 '12 at 08:13
1

You can use file.readlines() with an aditional argument in order to read just a few first lines from the file. From the docs:

f.readlines() returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

Then the most robust way to manipulate generic strings are Regular Expressions. In Python, this means the re module with, for example, the re.sub() function.

My suggestion, which should be adapted to suit your needs:

import re

f = open('somefile.txt')
line4 = f.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in f.readlines[4:]])
f.close()
newfile = open('someotherfile.txt', 'w')
newfile.write(newfilestring)
newfile.close()
heltonbiker
  • 26,657
  • 28
  • 137
  • 252
  • thanks...I am trying this solution as the pythonic way from cjrh doesn't work on my new larger files. I am getting an error (below), your adapted code is in edit to question. Any ideas? – GeorgeC Mar 11 '12 at 22:53
  • @GeorgeC where it says `[line for line in fc.readlines[4:]]` it should be `[line for line in fc.readlines()[4:]]`. (note the parentheses after readline - it is a function, must be used like readline(), and then its result indexed, like readline()[4:] – heltonbiker Mar 12 '12 at 12:03
  • Also, it might be faster, if you figure it out how, to use `f.read()` instead of `[l for l in f.readlines()]`. The problem is that you need to do `f.seek(p)` before, where `p` is the position in the file where the 4th line begins. – heltonbiker Mar 12 '12 at 12:04
  • just tried on a 1gb file -still get a memory error. Any other ideas - SED works fine on the same file but isn't as easy to use. – GeorgeC Mar 15 '12 at 03:47