0

I have world trade data(import/export), approximately 1.5 million rows and I want to cut this number. I want to divide values to import_val and export_val. I created Export_Value column and copied values where df['Trade Flow'].str.contains('Export'). Now I want to concatenate 2 rows if the Partner country and the Commodity code are the same for Reporter country. The initial data:

    Year Trade Flow    Reporter               Partner Commodity Code                                          Commodity  Trade Value (US$) Export_Value
    0   2011     Import     Algeria               Belgium         010110  Live horses/asses/mules/hinnies: pure-bred bre...             397865             
    1   2011     Import     Algeria                France         010110  Live horses/asses/mules/hinnies: pure-bred bre...             269558             
    2   2011     Import     Andorra                France         010110  Live horses/asses/mules/hinnies: pure-bred bre...               7155             
    3   2011     Export     Andorra                France         010110  Live horses/asses/mules/hinnies: pure-bred bre...                418          418
    4   2011     Export     Andorra                 Spain         010110  Live horses/asses/mules/hinnies: pure-bred bre...               7668         7668
    5   2011     Import      Angola                Brazil         010110  Live horses/asses/mules/hinnies: pure-bred bre...             175400             
    6   2011     Import      Angola               Namibia         010110  Live horses/asses/mules/hinnies: pure-bred bre...              23588             
    7   2011     Import      Angola              Portugal         010110  Live horses/asses/mules/hinnies: pure-bred bre...               2190             
    8   2011     Import  Azerbaijan               Belgium         010110  Live horses/asses/mules/hinnies: pure-bred bre...             338753             
    9   2011     Export  Azerbaijan            Kazakhstan         010110  Live horses/asses/mules/hinnies: pure-bred bre...              22050        22050
    10  2011     Import  Azerbaijan    Russian Federation         010110  Live horses/asses/mules/hinnies: pure-bred bre...               1270             
    11  2011     Export  Azerbaijan                Turkey         010110  Live horses/asses/mules/hinnies: pure-bred bre...               2899         2899

and I want(please see Andorra):

     Year Trade Flow    Reporter               Partner Commodity Code                                          Commodity  Trade Value (US$)      Export_Value
0   2011     Import     Algeria               Belgium         010110  Live horses/asses/mules/hinnies: pure-bred bre...             397865             
1   2011     Import     Algeria                France         010110  Live horses/asses/mules/hinnies: pure-bred bre...             269558             
2   2011     Import     Andorra                France         010110  Live horses/asses/mules/hinnies: pure-bred bre...               7155          418  
3   2011     Export     Andorra                 Spain         010110  Live horses/asses/mules/hinnies: pure-bred bre...               7668         7668

I deleted the 3 rd row from the initial data and export and import values at the same row for the same partner country. How can I do that? Any suggestions??

Mikoupgrade
  • 65
  • 1
  • 9
  • You could drop duplicates on those columns: `df = df.drop_duplicates(subset=['Reporter', 'Partner'])` – cs95 Jun 17 '20 at 05:58
  • @cs95 thanks for your reply. However, I do not want to just drop, I want to separate Trade Value (US$) column into Export and Import. To do this I just want to copy export values to new column and rename Trade Value (US$) to Import_values. Furthermore, I want one row for trade values if Partner country same for export and import to appropriate Reporting country. Please look carefully the initial data and Resulted data. Thanks – Mikoupgrade Jun 17 '20 at 06:02
  • Is it possible to have multiple exports from Andorra to France? Or is this already the total? – maow Jun 17 '20 at 06:53
  • @maow For the '010110' Commodity code, it is total. I forget to mention the Commodity code. So Reporter, Partner, and Commodity code have to be the same. – Mikoupgrade Jun 17 '20 at 06:55

2 Answers2

0

How about simply using a pivot? You wouldn't even need your Export_value column anymore As long as the other columns are the same, you can just add them to the index parameter to keep them. If they are different, you will split up the Trade Value for each of them.

In [34]: df1 = pd.pivot_table(df, index=['Reporter', 'Partner'], columns='Trade Flow', values=['Trade Value (US$)'])

In [35]: df1
Out[35]: 
                              Trade Value (US$)          
Trade Flow                               Export    Import
Reporter   Partner                                       
Algeria    Belgium                          NaN  397865.0
           France                           NaN  269558.0
Andorra    France                         418.0    7155.0
           Spain                         7668.0       NaN
Angola     Brazil                           NaN  175400.0
           Namibia                          NaN   23588.0
           Portugal                         NaN    2190.0
Azerbaijan Belgium                          NaN  338753.0
           Kazakhstan                   22050.0       NaN
           Russian Federation               NaN    1270.0
           Turkey                        2899.0       NaN

To get back a table simple get rid of the column multiindex and reset the pivot index.

In [36]: df1.droplevel(0, axis = 1).reset_index()
Out[36]: 
Trade Flow    Reporter             Partner   Export    Import
0              Algeria             Belgium      NaN  397865.0
1              Algeria              France      NaN  269558.0
2              Andorra              France    418.0    7155.0
3              Andorra               Spain   7668.0       NaN
4               Angola              Brazil      NaN  175400.0
5               Angola             Namibia      NaN   23588.0
6               Angola            Portugal      NaN    2190.0
7           Azerbaijan             Belgium      NaN  338753.0
8           Azerbaijan          Kazakhstan  22050.0       NaN
9           Azerbaijan  Russian Federation      NaN    1270.0
10          Azerbaijan              Turkey   2899.0       NaN
Viknesh S K
  • 101
  • 3
  • 9
maow
  • 2,712
  • 1
  • 11
  • 25
0

Below is another approach by, identifying duplicates to obtain the requirement.

  1. creating a column Export Value.
df['Export_Value'] = np.where(df['Trade Flow'] == 'Export', df['Trade Value (US$)'], np.nan)
  1. Identifying the Duplicates. Using backfill, getting export value.
df1 = df[df.duplicated(subset = ['Reporter', 'Partner'], keep = False)]

df1.loc[:, 'Export_Value'] = df1.loc[:, 'Export_Value'].bfill()

df1.drop_duplicates(subset = ['Reporter', 'Partner'], keep = 'first', inplace = True)
  1. Appending back to original dataset - df. To retain req. columns, deleting unnecessary records.
df = df.append(df1)

df.drop_duplicates(subset = ['Reporter', 'Partner'], keep = 'last', inplace = True)
Viknesh S K
  • 101
  • 3
  • 9