I am attempting to construct dataframes using large amount of data stored in txt files. I did not construct the data, however, so I am having to work with the frustrating formatting contained within. I couldn't get my code to work within the large data (and almost crashed my computer doing so), so set up a smaller dataframe like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311;21233332 ROB21288 99421
4 12412421 POW94277 12231;33221
5 54221721 IRS21231;YOU28137 13123
My frustration lies in the use of semicolons in the data. The data is meant to represent IDs, but multiple IDs have been assigned to multiple variables. I want to repeat these rows so that I can search through the data for individual IDs and have a datatable that looks like so:
'Value' ID_1 ID_2
0 11122222 ABC42123 33333
1 21219299 YOF21233 88821
2 00022011 ERE00091 23124
3 75643311 ROB21288 99421
4 21233332 ROB21288 99421
5 12412421 POW94277 12231
6 12412421 POW94277 33221
7 54221721 YOU28137 13123
8 54221721 IRS21231 13123
Reindexing is not a problem, so long as the different IDs stay linked to each other and to their correct values.
Unfortunately, all my attempts to split the data have, so far, ended in abject failure. I have managed to set up a function that repeats data containing a semicolon, parse that through my function for each column, but then fail to split the data afterwards.
def delete_dup(df,column):
for a in column:
location = df.loc[df.duplicated(subset= column, keep=False)]
for x in location:
semicolon = df.loc[df[column].str.contains(';', regex=True)]
duplicate = semicolon.duplicated(subset= column, keep='first')
tiny_df = semicolon.loc[duplicate]
split_up = tiny_df[column].str.split(';')
return pd.concat([df, split_up])
'Value' ID_1 ID_2 0
11122222 ABC42123 33333 NaN
21219299 YOF21233 88821 NaN
00022011 ERE00091 23124 NaN
75643311;21233332 ROB21288 99421 NaN
12412421 POW94277 12231;33221 NaN
54221721 IRS21231;YOU28137 13123 NaN
75643311;21233332 ROB21288 99421 NaN
54221721 IRS21231;YOU28137 13123 NaN
12412421 POW94277 12231;33221 NaN
NaN NaN NaN [75643311, 21233332]
I feel like this is the closest I've come and it's still nowhere near what I want. Any "If" statements I try to do on dataframes are met with the "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." error, which is so frustrating to read. Any ideas on how to make pandas do what I want?