0

I have a huge CSV file with 10000 rows and 500 columns. I want to extract the data from header to the row which contains device_boot. I want to eliminate all rows after device_boot.

Example:

Name,Time,status,..
start,05:06:2018 10:10:23,good,..
start,05:06:2018 10:11:23,good,..
failure,05:06:2018 11:10:25,critical,..
device_boot,05:06:2018 13:11:25,reboot,..
start,05:06:2018 13:13:23,good,..
start,05:06:2018 13:16:23,good,..

So, I need to maintain in CSV file up to device_boot line (row) using pandas. I am able to remove the specific row on that keyword, but not able to extract up to that portion using pd.drop(...).

Thanks for the suggestions.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
sundarr
  • 385
  • 2
  • 8
  • 3
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Dec 12 '18 at 11:46
  • why dont you record the index number, that would be things easier i think – iamklaus Dec 12 '18 at 11:52
  • Possible duplicate of [How can I filter lines on load in Pandas read\_csv function?](https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function) – erncyp Dec 12 '18 at 12:03
  • In pandas you can use ```nrows```. Something like: ```data = pd.read_csv("filename.csv", nrows=1312)```. Just find the row count, i would have used simple bash commands like ```wc`` or a simple for loop.. – Aaron_ab Dec 12 '18 at 12:03
  • Yes,i know the row number and extract the rows upto keyword, but sometime not able to know the row number due to huge file. For getting rows based on index, I used, df = pd.read_csv(/home/ubuntu1/output.csv',low_memory=False, index_col=False, dtype='unicode8', nrows=590) – sundarr Dec 12 '18 at 13:11

2 Answers2

1

Use:

print(df.loc[:df['Name'].gt('device_boot').idxmin()+1,:])

Output will be expected output.

Update:

print(df.loc[:df.index[df['Name']=='device_boot'].tolist()[-1],:])

And that contains the 'device_boot' row, if want to remove it:

print(df.loc[:df.index[df['Name']=='device_boot'].tolist()[-1]-1,:])
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • 1
    yes,your answer do the filtering of the specific row only, but what my expectation is "how to extract rows upto device_boot word in the firstrow". In some csv files, device_boot will appear more than one time also, I need to extract all rows in the csv file upto last device_boot appear in first column. – sundarr Dec 12 '18 at 13:07
0

I findout the index of the keyword,like

val = df.loc[df['name']=='device_boot'].index
print val

Then, use that row index and retrive only until that variable,

rowretrive_index = val1+50  // any extra rows can be added here.
print rowretrive_index

df1 = df.iloc[1:rowretrive_index]
df1.to_csv('/out.csv',',',dtype='unicode8')

Hope it will useful. Thanks, Sundar

sundarr
  • 385
  • 2
  • 8