-1

I have a dataframe with words, an id and a code for the language. I have run this dataframe through a spellchecking algorithm - which returned a dataframe with only the words that needed a correction - and now I need to add the corrected words back to the original dataframe but this time the corrected word should replace the wrong one (I'm guessing I need to specify this with the id somehow).

Does anyone know how to solve this?

I have just appended the new dataframe back to the original for now and there's a number at the end for some reason...

1 Answers1

0

Without sample, it's hard to answer.

pd.concat([original_df, fixed_df]).drop_duplicates(['id', 'code'], keep='last')
>>> df1
   words  id code
0   helo   1    A
1  world   2    B

>>> df2
   words  id code
0  hello   1    A

>>> pd.concat([df1, df2]).drop_duplicates(['id', 'code'], keep='last')
   words  id code
1   word   2    B
0  hello   1    A
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • This looks like it should work but for some reason it does the same thing as when I use append. I also get this error message 'sys:1: DtypeWarning: Columns (0,1) have mixed types.Specify dtype option on import or set low_memory=False.' – CarinaTheBookworm Jul 03 '21 at 07:53
  • What is the output of `df1.info()` and `df2.info()`? – Corralien Jul 03 '21 at 08:08
  • df1 if RangeIndex: 227656 entries, 0 to 227655 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 227656 non-null object 1 name 227653 non-null object 2 language 227654 non-null object dtypes: object(3) memory usage: 5.2+ MB – CarinaTheBookworm Jul 03 '21 at 08:24
  • df2 is RangeIndex: 58369 entries, 0 to 58368 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 58369 non-null int64 1 id 58369 non-null int64 2 name 58366 non-null object 3 language 58369 non-null object dtypes: int64(2), object(2) memory usage: 1.8+ MB – CarinaTheBookworm Jul 03 '21 at 08:24
  • Can you try that: `pd.concat([df1, df2[df1.columns].astype({'id': str})])` – Corralien Jul 03 '21 at 08:36
  • Ok this seems to solve the problem of the extra column (Unnamed: 0) at the end and there's no number anymore but then when I use this line and the drop_duplicates as well but the drop duplicates does not seem to work – CarinaTheBookworm Jul 03 '21 at 08:48
  • Also as soon as I try the drop_duplicates on the new df I get the same last Unnamed column and the duplicates are still there – CarinaTheBookworm Jul 03 '21 at 08:53
  • Try to write your dataframes to csv file and reload it. `df1.to_csv("df1.csv", index=False)` and `df2[df1.columns].to_csv("df2.csv", index=False)` then `pd.concat([pd.read_csv("df1.csv"), pd.read_csv("df2.csv")]).drop_duplicates(['id', 'code'], keep='last')` – Corralien Jul 03 '21 at 12:57
  • you mean df1 = df1.to_csv("df1.csv", index=False), df2 = df2[df1.columns].to_csv("df2.csv", index=False) and then pd.concat([pd.read_csv("df1.csv") and pd.read_csv("df2.csv")]).drop_duplicates(['id', 'code'], keep='last') ? – CarinaTheBookworm Jul 03 '21 at 15:30
  • I keep getting this error NoneType' object has no attribute 'columns' – CarinaTheBookworm Jul 03 '21 at 15:34
  • I also tried solving this with for loops but it is a nighmare runtime wise... It won't finish running in 20 mins -which is long even for those 200k long Dataframes I think – CarinaTheBookworm Jul 03 '21 at 15:35
  • if you want to share your 2 csv files, I can take a look. – Corralien Jul 03 '21 at 16:36