0

I am trying to parse through a section of a huge dataset. The portion of the dataset that I have is a 3GB gzip file. The file is structured so it has x columns and millions of rows. The columns are separated by commas or some sort of common operator so I can read the file.

What I want to do is based on 2 ranges (i.e. value a < col1 < value b, value c < col2 < value d), check the values of two columns for each row of the dataset. If both of the values are within the range of the dataset, move the entire row of data to a new file(? not sure exactly what to store it in) and then return that new subset.

What I am missing is a fundamental understanding of how to handle iteration like this. I am struggling with what to do with the set after I have used the pandas read_csv function in order to filter the dataset. I think I should be using data frames to access the data I am looking for but I am not sure.

rye_bread
  • 95
  • 1
  • 3
  • 11
  • can you show a dummy dataset demostrating your requirements with expected output?, [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) might help – anky Feb 16 '19 at 06:39
  • "x", "b", "c", "d", "12", "30", "d" is an example row and every row will have different values within the " ". Basically I want to check every row in the set. For example if the value of the 5th col is in between 10 and 15 and the value of the 6th column in between 40 and 50, accept the entire row and move it into a new file/data set. So the final output would look the same as the original data set but only have rows that passed the 2 parameter ranges that were established. – rye_bread Feb 16 '19 at 06:41
  • Example Dataset: `"X”, “a”, “1”, “15”` `“Y”, “a”, “12”, “44”` `“Z”, “a”, “13”, “39”` `“W”, “a”, “11”, “47”` Filter: if the third column value is between 10 and 15, AND the 4th column value is between 40 and 50, then accept the entire row. Example Output: `“Y”, “a”, “12”, “44”` `“W”, “a”, “11”, “47”` – rye_bread Feb 16 '19 at 06:55
  • that doesnot help either. you should really go through the link in my comment above. and [this](https://stackoverflow.com/help/mcve) – anky Feb 16 '19 at 06:57
  • I do not understand what else would be needed to conceptually answer the question. I am not really looking for a code answer. There is no error in the code because there is no code. I am just looking for a conceptual answer on how to filter a large data set based on parameter values. If I am asking in the wrong forum please let me know. – rye_bread Feb 16 '19 at 07:05

2 Answers2

2

IIUC, Creating a minimal reproducible example:

from io import StringIO 
data = StringIO("""X, a, 1, 15
Y, a, 12, 44
Z, a, 13, 39
W, a, 11, 47""")
df=pd.read_csv(data,header=None)
df.columns=['A','B','C','D']
print(df)

  A   B   C   D
0  X   a   1  15
1  Y   a  12  44
2  Z   a  13  39
3  W   a  11  47

Now for filters using series.between() and df.loc[]:

cond1 = df['C'].between(10,15) #check if C betwen 10,15
cond2 = df['D'].between(40,50) #check if D between 40 and 50
df_new=df.loc[cond1&cond2] #pull data where both condition matches and place in new_df
print(df_new)

   A   B   C   D
1  Y   a  12  44
3  W   a  11  47

If comfortable with sql queries(just for experimentation) :) :

import pandasql as ps
ps.sqldf('select * from df where df.C between 10 and 15 and df.D between 40 and 50')

   A   B   C   D
0  Y   a  12  44
1  W   a  11  47
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Ah okay, so the best way to move through a large data set and filter it would to be to set up the data as a df in python and then establish conditions to filter it. I also now understand what you meant by creating a proper example. Sorry about that. – rye_bread Feb 16 '19 at 07:42
  • @KunalRai yes pandas makes it easier to manipulate data. :) also you ca use `read_csv` to read the file and store it in a dataframe post that things will be easier. No problem, those links will be a future reference for you. ;) – anky Feb 16 '19 at 07:42
1

Solution with DataFrame.query for filtering:

#anky_91 dataset
df1 = df.query('10 < C < 15 & 40 < D < 50 ')
print (df1)
   A  B   C   D
1  Y  a  12  44
3  W  a  11  47
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can lists be used inside a query? If I wanted to use an indexed value of a list as a parameter for the column of the df could I use that? – rye_bread Feb 17 '19 at 06:43