0

I have a data frame(in csv file) with two columns each containing lists(of variable length) in string format. I am providing the link to the google drive where I have stored the csv file for reference https://drive.google.com/file/d/1Hdu04JdGpPqG9_k6Mjx_1XNLBvogXfnN/view?usp=sharing The dataframe looks like this

              Opp1                                               Opp2
0       ['KingdomofPoland','GrandDuchyofLithuania']       ['Georgia']
1       ['NorthernYuanDynasty']                           ['Georgia']
2       ['SpanishEmpire','CaptaincyGeneralofChile']       ['ChechenRepublic']

... ... ...
3409    ['Turkey','SyrianOpposition']                     ['CatholicLeague','SpanishEmpire']
3410    ['Egypt','UnitedArabEmirates']                    ['SpanishEmpire']
3411    ['Turkey','SyrianOpposition']                     ['SpanishEmpire']
3412    ['UnitedStates','UnitedKingdom','SaudiArabia']    ['SpanishEmpire']
3413    ['Turkey']                                        ['Russia']
3414 rows × 2 columns

The columns values are strings, I figured that out when I do

Input - df['Opp1'][0][0]
Out - '['

Output is given as '['. Instead the output should be the first element of the list of first row i.e 'KingdomofPoland'.

After solving this issue, I want to create a new column by combining elements of lists from each row of Opp1 and Opp2 columns. The elements of each row in Opp1 column are the name of countries and empires that were involved in a war with the corresponding country/empire of the same row in Opp2 column. So basically a new column with row entries as

             new_col
0     ['KingdomofPoland', 'Georgia']
0     ['GrandDuchyofLithuania', 'Georgia']
1     ['NorthernYuanDynasty', 'Georgia']
2     ['SpanishEmpire', 'ChechenRepublic']
2     ['CaptaincyGeneralofChile', 'ChechenRepublic']
... ... ...
3409    ['Turkey', 'CatholicLeague']
3409    ['Turkey', 'SpanishEmpire']
3409    ['SyrianOpposition', 'CatholicLeague]
3409    ['SyrianOpposition', 'SpanishEmpire']
3410    ['Egypt','SpanishEmpire']
3410    ['UnitedArabEmirates','SpanishEmpire']
3411    ['Turkey', 'SpanishEmpire']
3411    ['SyrianOpposition', 'SpanishEmpire']
.................

This will essentially introduce new rows as we are kind of exploding the Opp1 and Opp2 columns simultaneously iterating over there rows elements.

The end goal is to get an edge list of countries that were involved in a specific war represented by the original Opp1(opposition 1) and Opp2(opposition2) columns. Each entity(country) from Opp1 row list should be attached to each entity(country) of Opp2 row list. The final dataset will be used on Gephi as edge lists.

I am a beginner in data analysis with python. till now I have been cleaning my dataset manually which has consumed upteen precious hours. Can anyone help me with this.

Note - There are multiple similar entries in each row of Opp1 and Opp2 columns as same countries fought wars many times in different years.

I am attaching the pic for df_types of my dataframe as requested.enter image description here

Aalekh Roy
  • 29
  • 6
  • 1
    Does this answer your question? [How to explode a list inside a Dataframe cell into separate rows](https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows) – Nick_Z0 Jul 20 '22 at 14:37
  • @Nick_Z0 using .explode on any of the two columns Opp1 and Opp2 do nothing. The row entries of each columns are not lists but a string formatted as a list as I have explained in the question. When I enquire about the first element of first row of column Opp1, it doesn't return the element name instead it return '[' as the output. so I belive that's why explode() is not working. – Aalekh Roy Jul 20 '22 at 14:45

1 Answers1

1

IIUC, do some data clean up by remove an intra-string single quote. And, then use library yaml to convert your string to actual list in each pandas dataframe cell with applymap. Lastly, apply explode to your dataframe twice once for each column you want to expand.

import yaml
import pandas as pd

df = pd.read_csv('Downloads/nodes_list.csv', index_col=[0])

df['Opp1'] = df['Opp1'].str.replace("[\'\"]s",'s', regex=True)
df['Opp2'] = df['Opp2'].str.replace("[\'\"]s",'s', regex=True)

df = df.applymap(yaml.safe_load)

df_new = df.explode('Opp1').explode('Opp2').apply(list, axis=1)

df_new

Output:

0                       [KingdomofPoland, Georgia]
0                 [GrandDuchyofLithuania, Georgia]
1                   [NorthernYuanDynasty, Georgia]
2                 [SpanishEmpire, ChechenRepublic]
2       [CaptaincyGeneralofChile, ChechenRepublic]
                           ...                    
3411             [SyrianOpposition, SpanishEmpire]
3412                 [UnitedStates, SpanishEmpire]
3412                [UnitedKingdom, SpanishEmpire]
3412                  [SaudiArabia, SpanishEmpire]
3413                              [Turkey, Russia]
Length: 31170, dtype: object
Scott Boston
  • 147,308
  • 15
  • 139
  • 187