1

I want to select only rows that have fc_id == 2, and then delete those having duplicates. This is my input file enter image description here

I have been stuck on the first step only. After that I also need an ouput file where I will get my final data with fc_id==2 and no duplicates.

I tried this:

df = pd.read_csv(r'test.csv')
df2 = df[df["fc_id"]==2]

def condi(df2):
    df3[x] = np.where(df(df2)==2, 1, 0)
    return x
var = condi(df2)
#print(var)

with open('test.csv', 'r') as in_file, open('out_test.csv', 'w') as out_file:
    seen = set()
    if var == 1:
         for line in in_file:
            if line in seen: continue

            seen.add(line)
            out_file.write(line)

I am getting an error and when I tried to print(var) it said "'DataFrame' object is not callable".

Fanatic
  • 43
  • 5

2 Answers2

1

Like this:

df = pd.read_csv(r'test.csv')
df2 = df[df['fc_id'] == 2]
df2.drop_duplicates(inplace=True)
gtomer
  • 5,643
  • 1
  • 10
  • 21
1

For selecting dataframe given a certain equality condition: df=df[df['column_name'] == some_value]

In your case:

df = df[df["fc_id"]==2]

For removing duplicates, you can then use

result_df = df.drop_duplicates(keep='first')