1

I have several large text files (30m+ lines, >1GB) which are being processed in ArcGIS after splitting (see Remove specific lines from a large text file in python and chunk a text database into N equal blocks and retain header for background).

Even after splitting the process takes over 3 days so I want to delete all xy points which have a (Rx) value less than or equal to 0.

I haven't got python to work on reading txt datasets over 500Mb so I have used cygwin/SED commands to do the initial cleaning of the data and then python to chunk the file. So ideally the process would be to add some code to the python (see below) to not include all lines with Rx<=0.

Latitude    Longitude   Rx  Best_Unit
-16.37617    144.68805  -012.9  7
-16.37617    144.68834  -015.1  7
-16.37617    144.68861  -017.2  7
-16.37617    144.68890  -018.1  7
-16.37617    144.68919  -025.0  7
-16.37617    144.68945  -019.5  7
-16.37617    144.68974  -020.0  7
-16.37617    144.69003  -020.4  7
-16.37617    144.69623   015.3  7
-16.37617    144.69652   015.6  7
-16.37617    144.69679   015.8  7
-16.37617    144.69708   016.0  7
-16.37617    144.70076   005.0  7
-16.37617    144.70103   002.2  7
-16.37617    144.70131  -000.2  7
-16.37617    144.70160  -001.5  7
-16.37617    144.70187  -001.0  7
-16.37617    144.70216   000.7  7
-16.37617    144.70245   002.2  7
-16.37617    144.70273   008.4  7
-16.37617    144.70300   017.1  7
-16.37617    144.70329   017.2  7

I want all rows (lines) where Rx>0 to be written into a new text file. I also want the column Best_Unit deleted.

from itertools import islice

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='cb_vhn007_5.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'.txt'

headers_count = 1
chunk_size = 500000

with open(fl) as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open(Name+'_%d.txt' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
         ## add something here to check if value after third tab
         ## is >0 and if so then write the row or skip.
        fout.write(line) 

    print "Created part %d" % part
    part += 1

New Code - first line includes - Rx values.

from itertools import islice

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='cb_vhn007_5.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'.txt'

headers_count = 1
chunk_size = 500000

with open(fl) as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open(Name+'_%d.txt' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
        if line.split()[2][0:1] != '-':
          #print line.split()[2]
          fout.write(line)

    print "Created part %d" % part
    part += 1
Community
  • 1
  • 1
GeorgeC
  • 956
  • 5
  • 16
  • 40
  • You might want to do this with several threads (on a multicore machine), and in a language that doesn't have a GIL. – Dhaivat Pandya Mar 22 '12 at 01:55
  • how? using what? since the python is reading/writing line by line anyway (takes only <5m to process a 30m plus line dataset) it shouldn't be complicated to get it to check if Rx<=0 (i just don't know how to). – GeorgeC Mar 22 '12 at 01:58
  • That's not the question; it shouldn't be done like this, because you're not taking advantage of the resources you have. http://golang.org/ is especially interesting for this kind of stuff. Will make the execution time a lot faster. – Dhaivat Pandya Mar 22 '12 at 02:07
  • I have trouble fathoming how your work on as little as a gigabyte of data takes _three days_. I didn't bother looking too deeply at your previous questions (and I've got a sneaking suspicion you're just using SO to write your entire application for you) -- but iterating through only 30M entries of plain text can't be _too_ horrible. Are you certain your application doesn't need a re-design from the ground up? – sarnold Mar 22 '12 at 02:13
  • thanks for the link...will have to look into that but that's a longer term solution and really for such large datasets I will have to invest in learning postgis/postgres rather than arcgis. For now can I achieve the same in python? maybe using find(\t) but how do i get the 2nd tab? – GeorgeC Mar 22 '12 at 02:18

5 Answers5

2

Probably just checking line_iter[24] != '-' should be enough.

I.e. replace:

fout.write(line)

with

if line_iter[24] != '-':
  fout.write(line)
Qtax
  • 33,241
  • 9
  • 83
  • 121
2

You can use line.split() to split your line into a list containing the value of each of the 4 columns.

For example:

line='-16.37617\t144.70329\t017.2\t7'
line.split()
# ['-16.37617', '144.70329', '017.2', '7']

Then you can either coerce line[2] (remember python is 0-based indexing) to a number and check whether it's > 0:

if float(line.split()[2]) > 0:
    fout.write(line)

Or you could just check whether it has a minus sign in it:

if line.split()[2].find('-') != -1:
    fout.write(line)

If your columns might not be in the same order for each text file, you could perform split() on your headers, determine which is Rx, and use that instead of the 2:

i = headers.split().index('Rx')
# now use line[i]
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • The `index('Rx')` is an especially nice touch. – sarnold Mar 22 '12 at 02:13
  • @mathematical.coffee This looks like a good solution but it just includes all lines of the detaset. I tried the following; check = line.split()[2][0:1] and then if float(check) <=0 : but this fails with an invalid literal – GeorgeC Mar 22 '12 at 05:06
  • I don't know where the `[0:1]` came in, `float(line.split()[2]) > 0` is just fine – mathematical.coffee Mar 22 '12 at 05:14
  • sorry that was my mistake...the issue of the first row in the first chunk containing a negative Rx value record still persists. The last row also has just '-1'. It worries me as other less obvious issues maybe happening. – GeorgeC Mar 22 '12 at 05:24
  • It looks like that's because of your `first_line=line_iter.next()` - you're discarding it before then doing `for line in line_iter` which isn't starting from the beginning, but from the second line (where it's currently up to). – mathematical.coffee Mar 22 '12 at 08:21
  • Thanks. Accepted solution as in the end I could also discard values <10 so other solutions which didn't compare values wouldn't work. – GeorgeC Mar 23 '12 at 09:10
  • @mathematical.coffee you are probably correct about the for loop but I am not sure how to correct the issue. Do you know what I need to change? – GeorgeC Mar 23 '12 at 13:16
2

I know its not python but it is probably the right tool for the job:

cat cb_vhn007_5.txt | awk '($3 > 0) {print $0}' > parsedfile
vascop
  • 4,972
  • 4
  • 37
  • 50
  • thanks. Tried this first (seems to work) but it plugs the first row in to the same line as the header. – GeorgeC Mar 22 '12 at 02:55
  • What do you mean? I tried it before answering and everything looked ok. Checked it again too. – vascop Mar 22 '12 at 12:26
  • when I ran the command the output file had the header and first row of data in the top line. – GeorgeC Mar 23 '12 at 09:02
1

The lines you're looking for are something like this:

if line.split()[2][0:1] != "-"
  fout.write(line)

This splits the input, looks at the third entry, looks at the first character, and skips printing the line if it is a -.

sarnold
  • 102,305
  • 22
  • 181
  • 238
  • works fine on first look but the then the first line of some of the chunks have rows with negative Rx values included. Any ideas. Code in question updated. – GeorgeC Mar 22 '12 at 03:11
  • It'd depend upon the data in question. Note that my answer also fails [VascoP's `==0` observation](http://stackoverflow.com/questions/9815414/remove-rows-which-have-a-specific-attribute-less-than-or-equal-to-0#comment12503924_9815492). – sarnold Mar 22 '12 at 03:15
  • yup - @mathematical.coffee answers address this (in theory) but when I run it all values are included. It seems to be treating the output as text and not float... – GeorgeC Mar 22 '12 at 04:59
1

Here is a python script that will read a file containing data formatted as four space-separated fields per row, inspect the third field, and output any lines whose third field is a positive floating point number.

Tested working on Python 2.7.2.

import re

in_fh = open ("gis.txt","r")
out_fh = open ("outfile.txt","w")

for row in in_fh:
    row = re.sub(' +',',',row) # convert to comma-separated format
    try:
        latitude, longitude, rx, best_unit = row.split(',')
    except ValueError: # row didn't have four fields
        print ("complain - not four fields")
        continue

    try:
        float_rx = float(rx)
    except ValueError: # rx could not be cast to float
        print ("complain - third field not float")
        continue

    if float_rx > 0:
        out_fh.write(latitude + "," + longitude + "," + rx + "\n")
    else:
        pass # discard the row

in_fh.close()
out_fh.close()

Only one row is processed at a time, so memory usage should be constant regardless of the size of the input and output files.

Alternately, have you considered using a database? sqlite3 is built in and would probably handle 1Gb of data just fine. Then you could get this result by doing SELECT * FROM data WHERE rx > 0.

Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • thanks. Yes it would be best to do this in a db -postgres/postgis but I don't have the time to learn it now. – GeorgeC Mar 22 '12 at 04:54