0

I am new to Pandas and was wondering how to delete a specific row using the row id. Currently, I have a CSV file that contains data about different students. I do not have any headers in my CSV file.

data.csv:

John    21 34 87 ........ #more than 100 columns of data
Abigail 18 45 53 ........ #more than 100 columns of data
Norton  19 45 12 ........ #more than 100 columns of data

data.py:

I have a list that has a record of some names.

names = ['Jonathan', 'Abigail', 'Cassandra', 'Ezekiel']

I opened my CSV file in Python and used list comprehension in order to read all the names in the first column and store them in a list with a variable 'student_list' assigned.

Now, for all elements in the student_list, if the element is not seen in the 'names' list, I want to delete that element in my CSV file. In this example, I want to delete John and Norton since they do not appear in the names list. How can I achieve this using pandas? Or, is there a better alternative out there than compared to using pandas for this problem?

I have tried the following code below:

csv_filename = data.csv
    with open(csv_filename, 'r') as readfile:
        reader = csv.reader(readfile, delimiter=',') 
        student_list = [row[0] for row in reader]  #returns John, Abigail and Norton.

        for student in student_list:
        if student not in names:
            id = student_list.index(student) #grab the index of the student in student list who's not found in the names list.

            #using pandas
            df = pd.read_csv(csv_filename) #read data.csv file
            df.drop(df.index[id], in_place = True) #delete the row id for the student who does not exist in names list.
            df.to_csv(csv_filename, index = False, sep=',')  #close the csv file with no index
        else:
            print("Student name found in names list")

I am not able to delete the data properly. Can anybody explain?

retro_coder
  • 367
  • 1
  • 9
  • 25
  • Okay, I'l give it a shot. – retro_coder Nov 13 '19 at 06:27
  • @jezrael I got a list of all my data inputted in the csv file. – retro_coder Nov 13 '19 at 06:36
  • So what is `print (df.index.tolist()[:5])` ? there is `3` value? – jezrael Nov 13 '19 at 06:49
  • @jezrael Just asking, when using pandas, does the index start at 0 or 1? Yes, I feel there is some index issue. – retro_coder Nov 13 '19 at 06:51
  • It start from `0` like in python. – jezrael Nov 13 '19 at 06:52
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Nov 13 '19 at 08:20
  • @jezrael I re-edited the question once more. It would be great if you can take a look at it. Thank you. – retro_coder Nov 13 '19 at 09:13
  • Yop, so expected output is remove `John` row, because not match in `names = ['Jonathan', 'Abigail', 'Norton', 'Cassandra', 'Ezekiel']` ? – jezrael Nov 13 '19 at 09:16
  • Yes, I want to remove Norton as well. I made changes again to my names list. John and Norton should be removed. – retro_coder Nov 13 '19 at 09:18

1 Answers1

0

You can just use a filter to filter out the ids you don't want.

Example:

import pandas as pd
from io import StringIO

data = """
1,John
2,Beckey
3,Timothy
"""

df = pd.read_csv(StringIO(data), sep=',', header=None, names=['id', 'name'])


unwanted_ids = [3]

new_df = df[~df.id.isin(unwanted_ids)]

You could also use a filter and get the indices to drop the columns in the original dataframe. Example:

df.drop(df[df.id.isin([3])].index, inplace=True)

Update for updated question:

df = pd.read_csv(csv_filename, sep='\t', header=None, names=['name', 'age'])
# keep only names wanted and reset index starting from 0
# drop=True makes sure to drop old index and not add it as column
df = df[df.name.isin(names)].reset_index(drop=True)
# if you really want index starting from 1 you can use this
df.index = df.index + 1
df.to_csv(csv_filename, index = False, sep=',')
mjspier
  • 6,386
  • 5
  • 33
  • 43
  • can data be replaced with my CSV file? – retro_coder Nov 13 '19 at 09:28
  • I changed the format to comma separated as this is better for copy pasting from SO. But indeed you can just read your tab separated file like: `df = pd.read_csv('drop_columns.csv', sep='\t', header=None, names=['id', 'name', 'age'])` – mjspier Nov 13 '19 at 09:44
  • Is it possible to read and make changes to the same CSV file? Why did you use 'drop_columns.csv' here? – retro_coder Nov 13 '19 at 10:18
  • My mistake, fixed it. Yes you can read csv_filename and then overwrite the same file. – mjspier Nov 13 '19 at 11:51
  • Which method should I use? In your opinion, should I use the drop or filter option? I don't want blank rows in my CSV file after deletion and the row id should be ordered without any gaps i.e 1,2,3,4 instead of 1,4,5,6,9 etc... – retro_coder Nov 14 '19 at 04:38
  • I would use the filter in this case. Drop does not add any value. I updated the solution which resets the index. For future questions I recommend to be more specific when you ask the question. Also adding an expected output example can help. The question changed quite a bit from the original question and additional questions in comments make it difficult to follow. And keep the question about one topic/problem. Please have a look here: https://stackoverflow.com/help/how-to-ask – mjspier Nov 14 '19 at 09:09