1

I have brought the data from excecl(csv) and it is about 300000 rows * 1 column and I plotted by db.plot() ----- it is a time series data

I am trying to delete (drop) the data that is higher than 0.006 and, after that

I want to compare the difference between the data(step by step : the one next to each other) and if the difference is bigger 0.00001, I want to drop thos data also.

then, I will be only left with data with very very low difference (almost 0, flat slope).

I am a very beginner in python and I tried my best but I don't know what is wrong with my code:

import pandas as pd

excel_df = pd.read_csv('data.csv', header=None)

excel_df.plot()

bool_idx = excel_df < 0.006

valid_data = excel_df[bool_idx]

true_data = valid_data.dropna()

# print(true_data)
# print(valid_data)

ax1 = valid_data.plot()

ax1.set_ylim(-0.005, 0.045)

ax1.plot()

print(true_data)

al2 = true_data.diff()

# print(al2)

number = 0

for true_data in ture data:

    number = number + 1

    if true_data.diff() < 0.00001:

        true_data.drop()

print(true_data)
recnac
  • 3,744
  • 6
  • 24
  • 46
ddbae
  • 13
  • 3

1 Answers1

0

Try running this on your dataset.

#!/usr/bin/env python3
# coding: utf-8

# In[1]:


import pandas as pd

excel_df = pd.read_csv('data.csv', header=None)

x=excel_df.plot()
# x

bool_idx = excel_df < 0.006
# bool_idx

valid_data = excel_df[bool_idx]
# valid_data

true_data = valid_data.dropna()
# true_data

ax1 = valid_data.plot()

ax1.set_ylim(-0.005, 0.045)
# ax1


al2 = true_data.diff()
# al2

number = 0

for (true_data_diff_val,rid) in zip(true_data.diff()[0],true_data.diff()[0].index):
#     print(number,true_data_diff_val)
#     print(rid)
    if true_data_diff_val < 0.00001 and rid != 0:
        true_data=true_data.drop(int(rid),0)
        print(rid)
print(true_data)

Your query to my understanding is to get the row/tuple ID w.r.t. the if condition within loop in order to drop it from another dataframe. The simplest method I know of uses zip function to get it iterated parallel with the data.

Also you need to save the dataframe when you drop a row/column in order to observe changes!

I checked for rid!=0 because diff() gives first element as NaN, you can apply any appropriate condition.

icy121
  • 131
  • 7
  • THANK YOU !! thank you for the comment :) I have another question though. when I add the condition and run the program, it takes quite amount of time. Therefore, I tried to used lambda and tried to apply fuction in pandas dataframe function to reduce the running time. For example, I think if I could put the condition function(drop and diff() , ) in the dataframe, it would run the program much faster.. is that correct? I tried to read the https://stackoverflow.com/questions/37428218/how-to-properly-apply-a-lambda-function-into-a-pandas-data-frame-column but it doesn't help... – ddbae Apr 11 '19 at 02:50
  • I'm not sure whether the speed of the program is due to the size of dataset or the condition since it finishes within 1 or 2 seconds for a 1000 entries in data.csv! But I'd rather use a def for easy syntax and to make the code readable in future. https://stackoverflow.com/questions/134626/which-is-more-preferable-to-use-in-python-lambda-functions-or-nested-functions – icy121 Apr 11 '19 at 05:50