2

I am attempting to construct dataframes using large amount of data stored in txt files. I did not construct the data, however, so I am having to work with the frustrating formatting contained within. I couldn't get my code to work within the large data (and almost crashed my computer doing so), so set up a smaller dataframe like so:

    'Value'             ID_1                ID_2
0   11122222            ABC42123            33333
1   21219299            YOF21233            88821
2   00022011            ERE00091            23124
3   75643311;21233332   ROB21288            99421
4   12412421            POW94277            12231;33221
5   54221721            IRS21231;YOU28137   13123

My frustration lies in the use of semicolons in the data. The data is meant to represent IDs, but multiple IDs have been assigned to multiple variables. I want to repeat these rows so that I can search through the data for individual IDs and have a datatable that looks like so:

    'Value'             ID_1                ID_2
0   11122222            ABC42123            33333
1   21219299            YOF21233            88821
2   00022011            ERE00091            23124
3   75643311            ROB21288            99421
4   21233332            ROB21288            99421
5   12412421            POW94277            12231
6   12412421            POW94277            33221
7   54221721            YOU28137            13123
8   54221721            IRS21231            13123

Reindexing is not a problem, so long as the different IDs stay linked to each other and to their correct values.

Unfortunately, all my attempts to split the data have, so far, ended in abject failure. I have managed to set up a function that repeats data containing a semicolon, parse that through my function for each column, but then fail to split the data afterwards.

def delete_dup(df,column):
for a in column:
    location = df.loc[df.duplicated(subset= column, keep=False)]
    for x in location:
        semicolon = df.loc[df[column].str.contains(';', regex=True)]
        duplicate = semicolon.duplicated(subset= column, keep='first')
        tiny_df = semicolon.loc[duplicate]

        split_up = tiny_df[column].str.split(';')

        return pd.concat([df, split_up])



  'Value'              ID_1              ID_2           0
  11122222             ABC42123          33333          NaN
  21219299             YOF21233          88821          NaN
  00022011             ERE00091          23124          NaN
  75643311;21233332    ROB21288          99421          NaN
  12412421             POW94277          12231;33221    NaN
  54221721             IRS21231;YOU28137 13123          NaN
  75643311;21233332    ROB21288          99421          NaN
  54221721             IRS21231;YOU28137 13123          NaN
  12412421             POW94277          12231;33221    NaN
  NaN                  NaN               NaN            [75643311, 21233332]

I feel like this is the closest I've come and it's still nowhere near what I want. Any "If" statements I try to do on dataframes are met with the "ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." error, which is so frustrating to read. Any ideas on how to make pandas do what I want?

  • Look at https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows#17116976 - this tells you how to split. If the data sets are so long, that you can't load them into memory, this will not help much. In that case, reading the file line by line and writing a fixed file might do the trick, because you don't need the full file in memory. You can also read the file a chunks of data to avoid memory problems. – 576i Oct 03 '18 at 16:15
  • Just to be clear, shouldn't the row with `IRS21231` come before the row with `YOU28137`? In other words, the last two rows should be swapped. – rahlf23 Oct 03 '18 at 17:19
  • There are some great answers here: https://stackoverflow.com/a/40449726/8146556 – rahlf23 Oct 03 '18 at 17:34

2 Answers2

0

Perhaps not the most elegant way, but this just solves the problem:

Step 1

Data we have:

df
    'Value'     ID_1        ID_2
0   11122222    ABC42123    33333
1   21219299    YOF21233    88821
2   00022011    ERE00091    23124
3   75643311;21233332   ROB21288    99421
4   12412421    POW94277    12231;33221
5   54221721    IRS21231;YOU28137   13123

Step 2

Let's split misbehaving columns

df["'Value'_Dupe"] = df["'Value'"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["'Value'"] = df["'Value'"].apply(lambda x: x.split(";")[0])
df["ID_1_Dupe"] = df["ID_1"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_1"] = df["ID_1"].apply(lambda x: x.split(";")[0])
df["ID_2_Dupe"] = df["ID_2"].apply(lambda x: x.split(";")[1] if len(x.split(";"))>1 else np.NaN)
df["ID_2"] = df["ID_2"].apply(lambda x: x.split(";")[0])
df

    'Value'     ID_1        ID_2    'Value'_Dupe    ID_1_Dupe   ID_2_Dupe
0   11122222    ABC42123    33333   NaN             NaN          NaN
1   21219299    YOF21233    88821   NaN             NaN          NaN
2   00022011    ERE00091    23124   NaN             NaN          NaN
3   75643311    ROB21288    99421   21233332        NaN          NaN
4   12412421    POW94277    12231   NaN             NaN          33221
5   54221721    IRS21231    13123   NaN             YOU28137     NaN

Step 3

Let's combine dupes with original data into single dataframe:

df2 = df[pd.notna(df["'Value'_Dupe"])][["'Value'_Dupe","ID_1","ID_2"]]
df2.columns = ["'Value'","ID_1","ID_2"]

df3 = df[pd.notna(df["ID_1_Dupe"])][["'Value'","ID_1_Dupe","ID_2"]]
df3.columns = ["'Value'","ID_1","ID_2"]

df4 = df[pd.notna(df["ID_2_Dupe"])][["'Value'","ID_1","ID_2_Dupe"]]
df4.columns = ["'Value'","ID_1","ID_2"]

df5 = df[["'Value'","ID_1","ID_2"]]

df_result = pd.concat([df5,df2,df3,df4])
df_result



    'Value'     ID_1        ID_2
0   11122222    ABC42123    33333
1   21219299    YOF21233    88821
2   00022011    ERE00091    23124
3   75643311    ROB21288    99421
4   12412421    POW94277    12231
5   54221721    IRS21231    13123
3   21233332    ROB21288    99421
5   54221721    YOU28137    13123
4   12412421    POW94277    33221

Please let me know if this solves your problem.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • I can't seem to progress beyond step 2 here as it's throwing up an AttributeError at line 2. 'float' object has no attribute 'split'. I've tried running .astype on the data and it confirms they are all "Python Objects". How would I convert them to strings? – Sebastian Applewhite Oct 04 '18 at 09:31
  • This code should run for the data provided in the example without an error. Should you happen to have different data, without semicolon in one of the columns, `df = df.astype(str)` should be helpful. – Sergey Bushmanov Oct 04 '18 at 12:38
0

There are two parts to the solution. The first is to identify which rows have the semicolon, and the second it to create additional rows and concatenate them. The first part is done in contains_sc, and the second part is done by iterating over the rows and running the function create_additional_rows when a row with a semicolon is detected.

Hope this helps.

In[6]: import pandas as pd

In[7]: df = pd.DataFrame(
  [['1', '2;3', '4', '5'],
  ['A', 'B', 'C', 'D;E'],
  ['T', 'U', 'V;W', 'X']],
  index=['Val', 'ID1', 'ID2']
).T

In[8]: df

Out[8]: 
   Val  ID1  ID2
0    1    A    T
1  2;3    B    U
2    4    C  V;W
3    5  D;E    X

In[9]: contains_sc = df.apply(lambda x: x.str.contains(';'))
In[10]: contains_sc
Out[10]: 
     Val    ID1    ID2
0  False  False  False
1   True  False  False
2  False  False   True
3  False   True  False

In[11]: 
def create_additional_rows(data_row, csc_row, split_char=';'):
    """Given a duplicated row return additional de-duplicated rows."""
    if len(csc_row[csc_row].dropna()) > 1:
      raise ValueError('Expect only a single column with a semicolon')
    col_with_sc = csc_row[csc_row].dropna().index[0]
    retval = []
    for item in data_row.loc[col_with_sc].split(split_char):
      copied = data_row.copy()
      copied.loc[col_with_sc] = item
      retval.append(copied)
    return retval

In[11]: 
  new_rows = []
  for (idx, data_row), (_, csc_row) in zip(df.iterrows(), contains_sc.iterrows()):
    if True not in csc_row.values:
      new_rows.append(data_row)
      continue
    new_rows.extend(create_additional_rows(data_row, csc_row))

  final = pd.concat(new_rows, axis='columns').T.reset_index(drop=True)

In[13]: final
Out[13]: 
  Val ID1 ID2
0   1   A   T
1   2   B   U
2   3   B   U
3   4   C   V
4   4   C   W
5   5   D   X
6   5   E   X
Karthik V
  • 1,867
  • 1
  • 16
  • 23
  • Thank you for this answer, this works perfectly on my test data, however my larger dataset contains multiple columns with semicolons. What would you recommend as the quickest way around that? pd.concat? – Sebastian Applewhite Oct 04 '18 at 09:56
  • If I cut columns of my data to test it with only single columns of semicolons, I am now receiving an "IndexError: index 0 is out of bounds for axis 0 with size 0", pointing to ----> 5 col_with_sc = csc_row[csc_row].dropna().index[0] and ----> 7 new_rows.extend(create_additional_rows(data_row, csc_row)) – Sebastian Applewhite Oct 04 '18 at 10:20
  • I believe the mistake came from my error in transcribing it into functions. Thank you, this works. – Sebastian Applewhite Oct 04 '18 at 10:31
  • @SebastianApplewhite If you can have multiple columns in the same row with semicolons, then I am guessing you'd want the outer product of the repetition. i.e. ColX --> 'A;B', ColY --> 'C;D', then you'd want (A, C), (A, D), (B, C), and (B,D)? If so, take a look at itertools.product in the stdlib and adjust the iteration in create_additional_rows accordingly. – Karthik V Oct 04 '18 at 15:00