0

I am trying to write data to a csv file from a DataFrame (df), but only if the Unique ID of the df row does not already exist in the csv file. So in other words, only write into the csv file if the unique ID does NOT already exist.

So far what I have is this, which writes the data perfectly fine into the csv (I have only put the relevant parts of the code);

columns = [item for sublist in columns for item in sublist]
columns = list(dict.fromkeys(columns))
df = pandas.DataFrame(merged)
df.drop(df.columns[[0, 1, 20, 21, 22, 23]], axis = 1, inplace = True)

df.to_csv("MyData.csv", mode='a', header=None, index=False)

Now, to compare the columns in the pandas df to the corresponding csv column, I have done this;

col_list = ["Unique ID"]
df2 = pandas.read_csv("MyData.csv", usecols=col_list)
#print(df2["Unique ID"])

for csvdata in df2["Unique ID"]:
    for dfdata in df.iloc[:, 17]:
        if dfdata == csvdata:
            print("match")

This only works as far as advising me if there's a match i.e. if an ID in the Unique ID of the df exists in the destination csv file. What I need is to put in a condition which will only send the rows from the df where the unique identifier in the Unique ID column (index 17) does not exist in the csv file.

Any help would be appreciated.

Best Regards,

GBM
  • 99
  • 6
  • Filter `df` to the rows that aren't in `df2`. Then append those filtered rows to the CSV. – Barmar Jan 21 '22 at 17:38
  • Yes, to what @Bamar says and also just make a list out of column 17 using Pandas `.to_list()` method and then use Python's `set()` on that list to get a collection of all the unique IDs so you can compare with that more easily for your filtering. (Using `set()` on a list with Python is an idiomatic way to make a collection of unique elements.) In other words, you don't need to iterate and do it brute force as this can be done more efficiently using Pandas methods. Some help selecting rows is available [here](https://chrisalbon.com/python/data_wrangling/pandas_selecting_rows_on_conditions/). – Wayne Jan 21 '22 at 17:42
  • @Barmar thank you for the very prompt response. Could you please just expand a little on the filtering part; would I do that in place of the "print("match")"? Is what I have done in comparing the columns okay or does it need a bit more work? – GBM Jan 21 '22 at 18:07
  • @Wayne thank you as well for the quick reply - please could you see my additional question in the reply above to Barmar... – GBM Jan 21 '22 at 18:08
  • See https://stackoverflow.com/questions/48647534/python-pandas-find-difference-between-two-data-frames – Barmar Jan 21 '22 at 18:11
  • 1
    What you are doing may get the job done; however, it won't scale well and it isn't taking advantage of Pandas dataframe methods. You seldom need to iterate by row in a Pandas dataframe and if you do, you'd use `df.iterrows()` or, preferably `df.itertuples()`. Usually you can use `.apply()` without iterating to do complex stuff per row. **None of which I think are necessary here because there's an easy way to filter to a set of rows based on contents (or absence) among another list.** If you had an MRE provided already, I could illustrate. Otherwise I have to go round stuff up myself. – Wayne Jan 21 '22 at 18:25
  • 1
    See https://stackoverflow.com/a/43399866/8508004 where that example "will drop all rows containing elements of your list". In your case you'd make a list from the identifier's you already have. (Don't need to `set()` even.) And if your stuff is in another dataframe column already, it would be like this: `df_subset = df[~df['your column'].isin(df2["Unique ID"])]` . Or at least along those lines. – Wayne Jan 21 '22 at 18:33
  • 1
    I forgot to add a note about the `~` symbol I threw in there. `~` inverts the boolean values; it is similar to using a `not` in a conditional expression. So with the`~` there it makes the rows retained the ones that don't match. Removing the `~` would filter to only the rows with values that are in the column in the other dataframe already. – Wayne Jan 21 '22 at 18:47
  • @Wayne Thank you so much for your patience! So I have tried "df = df[~df.iloc[:, 17].isin([uniqueidincsv])]" but it still writes the duplicate data. Putting in the column name kept giving me a KeyError so I resorted to using iloc... – GBM Jan 21 '22 at 19:03
  • 1
    When you say it 'still writes the duplicate data', do you mean the dataframe leaves in the duplicate data or `df.to_csv(` isn't resulting in the correct thing after? Something isn't adding up. This is where [an MRE](https://stackoverflow.com/help/minimal-reproducible-example) would help. I cannot debug what I don't have. – Wayne Jan 21 '22 at 19:11
  • @Wayne I mean df.to_csv is still including the data which is already in the csv, instead of excluding it - I hasten to add I am not getting any errors. Might it be useful if I posted the code in its entirety? – GBM Jan 21 '22 at 19:26
  • @Wayne Not sure if I've done it in the correct Stack Overflow way, but I have edited my question and included the entire code for you... – GBM Jan 21 '22 at 19:32
  • 1
    But is the filtering working? You need to check the state of what you have before you do the `df.to_csv()` step. You need to determine if you are trying to debug the filtering or the writing of a the file at this point. And about your post. You probably shouldn't be posting an API key? That's in part why an MRE is recommended. Part of making one is helping you sort out your own issues on toy data. But with what you sent you, df2 creation doesn't work correctly. You cannot add a single column when you already have more columns there. It gives an error and no df2 gets made. – Wayne Jan 21 '22 at 20:25
  • 1
    Plus, you pretty much never want to be using `mode='a'`. You want to create a new file. Of course the old data will be there if you append to the old data. Also , bad idea to replace a file with a new file when working something out. You want to keep your original data file so you can compare, especially when developing. Save the final data to a new file name. – Wayne Jan 21 '22 at 20:29
  • 1
    df.iloc[:, 17] only has 'ZT2OZG' and '40HAS3' in it. So the first two should get filtered out? – Wayne Jan 21 '22 at 20:36
  • @Wayne Yes, precisely - 'ZT2OZG' and '40HAS3' should get filtered out and NOT be sent to the csv because they're already in the csv... – GBM Jan 21 '22 at 20:41
  • @Wayne Those are the unique identifiers so if they exist in df.iloc[:, 17], then we want to filter them out and not write them into the csv – GBM Jan 21 '22 at 20:43
  • 1
    Here's a Jupyter notebook showing what was being suggested @Barmar and myself works. https://gist.github.com/fomightez/06113e2516e455eb78ded9028f9f2c7e I removed the identifying code. You'll need to put it back for the code to work. (It's in the first 10 or so lines.) – Wayne Jan 21 '22 at 21:17
  • @Wayne Problem solved! Thank you so much - what I have done is df_subset = df[~df.iloc[:, 17].isin(df2["Unique ID"])] and to write to the csv I have used df_subset instead of df i.e. df_subset.to_csv(output_name+".csv", mode='a', header=None, index=False). Thank you so much for your patience and help! Also, on a final note, you mentioned it isn't advisable to use mode ='a' - what would you suggest I use instead of appending the data into the existing file? – GBM Jan 21 '22 at 22:07
  • @Barmar Thank you so much for your help as well - finally working! – GBM Jan 21 '22 at 22:07
  • Yes, if you needed it appended in the end. That's fine. Especially if it's too gigantic to make bringing it in to Pandas just to write it back out effective. However, there's better ways to go about things usually. In general setting up to be appending one row isn't how you use Pandas and can indeed cause things to be super slow. I'd use the whole dataframe in pandas and write it back out to replace the old file. You'll need the whole thing to use it effectively anyway. Especially watch out with appending during development because if you aren't writing anything, you won't really know that. – Wayne Jan 21 '22 at 22:19
  • @Wayne Brilliant - thank you for the very informative explanation - I shall take those pointers into account before deploying to Live. You've been exceedingly helpful! – GBM Jan 21 '22 at 22:41

0 Answers0