0

I have a large text file (about 44MB) and I'm trying to read and filter it without bankrupting my computer's memory. The text file is comma seperated, with each column representing different weather variables and then each row representing their values over time. I want to select only the rows in which the wind direction lies between 10 and 100 degrees and then to write these rows to a new file (txt file or csv would be fine). I am a complete python noob, so any help would be greatly appreciated!

A sample of the data

hd,Station Number,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local standard time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Universal coordinated time,Precipitation since last (AWS) observation in mm,Quality of precipitation since last (AWS) observation value,Air Temperature in degrees Celsius,Quality of air temperature,Air temperature (1-minute maximum) in degrees Celsius,Quality of air temperature (1-minute maximum),Air temperature (1-minute minimum) in degrees Celsius,Quality of air temperature (1-minute minimum),Wet bulb temperature in degrees Celsius,Quality of Wet bulb temperature,Wet bulb temperature (1 minute maximum) in degrees Celsius,Quality of wet bulb temperature (1 minute maximum),Wet bulb temperature (1 minute minimum) in degrees Celsius,Quality of wet bulb temperature (1 minute minimum),Dew point temperature in degrees Celsius,Quality of dew point temperature,Dew point temperature (1-minute maximum) in degrees Celsius,Quality of Dew point Temperature (1-minute maximum),Dew point temperature (1 minute minimum) in degrees Celsius,Quality of Dew point Temperature (1 minute minimum),Relative humidity in percentage %,Quality of relative humidity,Relative humidity (1 minute maximum) in percentage %,Quality of relative humidity (1 minute maximum),Relative humidity (1 minute minimum) in percentage %,Quality of Relative humidity (1 minute minimum),Wind (1 minute) speed in km/h,Wind (1 minute) speed quality,Minimum wind speed (over 1 minute) in km/h,Minimum wind speed (over 1 minute) quality,Wind (1 minute) direction in degrees true,Wind (1 minute) direction quality,Standard deviation of wind (1 minute),Standard deviation of wind (1 minute) direction quality,Maximum wind gust (over 1 minute) in km/h,Maximum wind gust (over 1 minute) quality,Visibility (automatic - one minute data) in km,Quality of visibility (automatic - one minute data),Mean sea level pressure in hPa,Quality of mean sea level pressure,Station level pressure in hPa,Quality of station level pressure,QNH pressure in hPa,Quality of QNH pressure,#
hd, 40842,2000,03,20,10,50,2000,03,20,10,50,2000,03,20,00,50,      ,N, 25.7,N, 25.7,N, 25.6,N, 21.5,N, 21.5,N, 21.4,N, 19.2,N, 19.2,N, 19.0,N, 67,N, 68,N, 66,N, 13,N,  9,N,100,N,  4,N, 15,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,51,2000,03,20,10,51,2000,03,20,00,51,   0.0,N, 25.6,N, 25.8,N, 25.6,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.4,N, 19.2,N, 68,N, 68,N, 66,N, 11,N,  9,N,107,N, 11,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,52,2000,03,20,10,52,2000,03,20,00,52,   0.0,N, 25.8,N, 25.8,N, 25.6,N, 21.7,N, 21.7,N, 21.5,N, 19.5,N, 19.5,N, 19.2,N, 68,N, 69,N, 66,N, 11,N,  9,N, 83,N, 13,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#

this is my best code, which keeps produce a NULL BYTE error message

import csv

with open("X:\hd_040842_2000_03.txt") as input, open("X:\wanted.txt", "w") as wanted, open("X:\unwanted.txt", "w") as unwanted:
   reader = csv.DictReader(input, delimiter=",")
   fieldnames = reader.fieldnames
   writer_wanted = csv.DictWriter(college, fieldnames, delimiter=",")
   writer_unwanted = csv.DictWriter(adult, fieldnames, delimiter=",")
   writer_wanted.writeheader()
   writer_unwanted.writeheader()
   for row in reader:
       if float(row["Wind (1 minute) direction in degrees true"]) > 10.0 and float(row["Wind (1 minute) direction in degrees true"]) < 110.0:
          writer_college.writerow(row)
       else:
          writer_adult.writerow(row)

So you know, I adapted this code as best I could from this post Use Python to select rows with a particular range of values in one column

Community
  • 1
  • 1
Joss Kirk
  • 1,873
  • 3
  • 11
  • 15
  • 3
    Please post a segment of your text file and code that you have tried! – Jason Mar 22 '16 at 03:36
  • I see you are also a stackoverflow noob. Those are both okay, though; noobs are welcome here! But if you want to get a good answer, you need more detail in your question. As Signal suggested, you should at least try to create some code. Also, we need to know what your text file consists of. Is there a header? Are some of the columns strings, or are they all numeric? Can you not fit 44MB into memory, so that you need to go line-by-line? These details will have huge effects on any answer. – Mike Mar 22 '16 at 03:44
  • 1
    Its pretty straight forward. Find examples for reading and writing csvs. There should be a for loop like `for row in reader:` (avoid the bad ones that do `readlines`). As you read rows, you check the condition for the row item you want and skip the bad ones. – tdelaney Mar 22 '16 at 03:45
  • Guilty of being a stackoverflow noob, I've updated my question to include the elements you suggested – Joss Kirk Mar 22 '16 at 07:51
  • Try adding `skipinitialspace=True` as a second parameter to your reader. Many of your cells have leading spaces which might confuse the conversion to `float`. – Martin Evans Mar 22 '16 at 08:00

2 Answers2

1

When working with csv files (in Python 2.x) you need to open the file using rb mode. Secondly, as your data contains a lot of leading spaces in the cells, I suggest adding the skipinitialspace=True parameter. Lastly, your posted code still had some left over changes from where you had copied it from, the following appears to work for the data you have provided:

import csv

with open(r"X:\hd_040842_2000_03.txt", "rb") as input, open(r"X:\wanted.txt", "wb") as wanted, open(r"X:\unwanted.txt", "wb") as unwanted:
   reader = csv.DictReader(input, delimiter=",", skipinitialspace=True)
   fieldnames = reader.fieldnames
   writer_wanted = csv.DictWriter(wanted, fieldnames, delimiter=",")
   writer_unwanted = csv.DictWriter(unwanted, fieldnames, delimiter=",")
   writer_wanted.writeheader()
   writer_unwanted.writeheader()

   for row in reader:
       if 10.0 <= float(row["Wind (1 minute) direction in degrees true"]) <= 110.0:
          writer_wanted.writerow(row)
       else:
          writer_unwanted.writerow(row)   

Note, if you are still getting problems, then add some diagnostics:

import csv

with open(r"X:\hd_040842_2000_03.txt", "rb") as input, open(r"X:\wanted.txt", "wb") as wanted, open(r"X:\unwanted.txt", "wb") as unwanted:
    reader = csv.DictReader(input, delimiter=",", skipinitialspace=True)
    fieldnames = reader.fieldnames
    writer_wanted = csv.DictWriter(wanted, fieldnames, delimiter=",")
    writer_unwanted = csv.DictWriter(unwanted, fieldnames, delimiter=",")
    writer_wanted.writeheader()
    writer_unwanted.writeheader()

    for line_number, row in enumerate(reader, start=1):
        try:
            if 10.0 <= float(row["Wind (1 minute) direction in degrees true"]) <= 110.0:
                writer_wanted.writerow(row)
            else:
                writer_unwanted.writerow(row)
        except:
            print "Failed to parse line", line_number
            print row

This will display the line number and row of any rows that cause a problem so you can investigate where the problem is.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • great! I'll check this out when I get back to my computer! Thanks for answering such a vague question. I'm a stackoverflow noob, but I've since updated my question to be more specific. Thanks Martin! – Joss Kirk Mar 22 '16 at 07:53
  • I got it to work! Thanks so much for your help! there were a few exceptions, but for the life of me I can't figure out what is wrong with those lines. there are few enough exceptions that I can just ignore those lines in the meantime, shouldn't have a noticeable effect on the analysis. – Joss Kirk Mar 23 '16 at 01:57
  • Glad it is working. I forgot to mention, the reported line number was counting from 0. I've changed it to count from 1. If you are happy with the answer, don't forget to click on the grey tick next to it to accept it as a solution. – Martin Evans Mar 23 '16 at 08:24
0

Any time you're doing more than just minimal data analysis, you'll want to use a package like numpy or pandas. These are the main packages everyone uses in python for data analysis; it quickly gets too complex to code all these things on your own. In general, numpy is easier to use but more strictly number-focused. On the other hand, pandas is much better when your data includes strings. I'll show both ways to do it.

Pandas

First you just read in the file. Then you find the rows where your min and max conditions are satisfied. You combine those conditions with &, and use that to actually select the rows you want to look at. Finally, you can write the data back to file. Of course, you could also plot it, or whatever.

import pandas as pd
data = pd.read_csv('weather_in.csv')
min_indices = (data['Wind (1 minute) direction in degrees true'] >= 10)
max_indices = (data['Wind (1 minute) direction in degrees true'] <= 100)
data = data[min_indices & max_indices]
data.to_csv('weather_out.csv')

Numpy

[The answer has been updated to show a sample of the CSV, which does contain text. In that case, pandas is the way to go instead of numpy. I'll just leave this to show what to do for cases where numpy is easier to use.]

Here's one easy way to do it, assuming your CSV contains only numerical values, and you can load 44MB into memory...

import numpy as np

wind_direction_column = 47
data = np.loadtxt('weather_in.csv', delimiter=',')
min_indices = (data[wind_direction_column] >= 10)
max_indices = (data[wind_direction_column] <= 100)
data = data[min_indices & max_indices]
np.savetxt('weather_out.csv', data, delimiter=',')

For both numpy and pandas, the greater-than/less-than lines could actually be combined into one, and the selection line placed into the (last) output line, but I think it's easier to understand when you put them each on their own lines.

Mike
  • 19,114
  • 12
  • 59
  • 91
  • Thanks Mike, the help is much appreciated. I'll try and implement this today. – Joss Kirk Mar 23 '16 at 00:48
  • A quick question, does the pd.read_csv command only work for .csv files? or can it also be used for .txt files? – Joss Kirk Mar 23 '16 at 00:53
  • You can see in [the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) that it assumes commas, but there are lots of options. In particular, you can tell it what separator to use with the `sep` argument. And if you pass `sep=None`, it will try to automatically figure it out. – Mike Mar 23 '16 at 03:10
  • ...and it's usually very good at figuring these things out, unless you have really strange separators. – Mike Mar 23 '16 at 03:16