1

sample image to check the data

SAMPLE DATA

Actual_Data     Final_Unique
8,8,8,8,8,8,        8,8
6,7,7,7,6,7,        6,7
2,12,3,4,12,12,     7,7
14,14,14,14,14,14,  7,6
1,1,12,2,2,4,       2,12
6,8,8,8,8,12,       12,3
6,6,6,6,3,14,       3,4
1,14,14,1,1,2,      4,12
1,1,1,1,1,14,       12,12

i am trying this

newdf = a.pivot(index='Actual_Data', columns='Final_Unique')

i have 2 column actual data and final_unique column in actual i have 44000 rows and final_unique has 266 rows i want to get the 266 column and actual data to stay same as it and count how many time actual data appears in the header

error getting

duplicate data exist can't reshape

output expected

ACTUAL DATA          8,8  6,7  7,7   7,6  2,12

8,8,8,8,8,8,          3   0    0      0    0

6,7,7,7,6,7,          0    2    1     1    0

2,12,3,4,12,12,       0    0    0     0    1
Quicklearner.gk
  • 133
  • 1
  • 1
  • 8
  • If you give the input data as text so that I can copy and paste it, I could try to answer. You should read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Serge Ballesta Feb 17 '20 at 09:18
  • ??? i didn't got you – Quicklearner.gk Feb 17 '20 at 10:02
  • You have only given an **image** of your input. Please put it directly as text in the question itself. I am much too lazy to write it by hand... – Serge Ballesta Feb 17 '20 at 10:13

2 Answers2

1

pivot_table allows an aggregate function, so if you only add on column containing 1 to you dataframe, you could get the expected result:

newdf = a.assign(val=1).pivot_table(values='val', index='Actual_Data',
                                    columns='Final_Unique', aggfunc=sum, fill_value=0
                                    ).reset_index().rename_axis(None, axis=1)

With the sample data it gives:

          Actual_Data  12,12  12,3  2,12  3,4  4,12  6,7  7,6  7,7  8,8
0       1,1,1,1,1,14,      1     0     0    0     0    0    0    0    0
1       1,1,12,2,2,4,      0     0     1    0     0    0    0    0    0
2      1,14,14,1,1,2,      0     0     0    0     1    0    0    0    0
3  14,14,14,14,14,14,      0     0     0    0     0    0    1    0    0
4     2,12,3,4,12,12,      0     0     0    0     0    0    0    1    0
5       6,6,6,6,3,14,      0     0     0    1     0    0    0    0    0
6        6,7,7,7,6,7,      0     0     0    0     0    1    0    0    0
7       6,8,8,8,8,12,      0     1     0    0     0    0    0    0    0
8        8,8,8,8,8,8,      0     0     0    0     0    0    0    0    1

From you comments, the two columns of the original a dataframe are not directly related but are in fact two independant Series, and you just want to know how many times a Final_Unique value occurs in a Actual_Data one.

This code should give it:

newdf = a[['Actual_Data']]
for col in set(a['Final_Unique'].dropna()):
    newdf[col] = newdf.Actual_Data.str.findall('(^|,)'+col+'(?=,)').apply(len)

The result is:

          Actual_Data  8,8  12,3  12,12  7,7  2,12  6,7  4,12  3,4  7,6
0        8,8,8,8,8,8,    3     0      0    0     0    0     0    0    0
1        6,7,7,7,6,7,    0     0      0    1     0    2     0    0    1
2     2,12,3,4,12,12,    0     1      1    0     2    0     1    1    0
3  14,14,14,14,14,14,    0     0      0    0     0    0     0    0    0
4       1,1,12,2,2,4,    0     0      0    0     0    0     0    0    0
5       6,8,8,8,8,12,    2     0      0    0     0    0     0    0    0
6       6,6,6,6,3,14,    0     0      0    0     0    0     0    0    0
7      1,14,14,1,1,2,    0     0      0    0     0    0     0    0    0
8       1,1,1,1,1,14,    0     0      0    0     0    0     0    0    0

The rationale behind the above code is:

  • compute the list of columns from the Final_Unique columns: the set will automatically remove duplicates (but will break order) and dropna() will remove empty values
  • create a new dataframe from the Actual_Data column.
  • add columns to that dataframe from the list computed at previous step and for each column
    • use str.findall and an appropriate regex to find non overlapping sequence containing the column name after a comma or at beginning of string and followed by a comma - that last comma not being included in the matched string to allow consecutive sequences to be correctly counted, hence the (?=...)
    • apply the len function to tranform that list in a simple count
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • you didnt understand the question , i just want to get the final unique column to be header and then find 8,8 6,7 7,7 7,6 2,12(as they are in column now then in actual_data column i want to check how many time 8,8 6,7 7,7 7,6 2,12 appears traversing each row) – Quicklearner.gk Feb 17 '20 at 10:38
  • @Quicklearner: this is the reason why it is important to give some data with the associated output. As I could not imagine how the *expected* output was related to the sample data, I have just tried what I had understood... – Serge Ballesta Feb 17 '20 at 10:44
  • did you understand now ? – Quicklearner.gk Feb 17 '20 at 10:45
  • @Quicklearner: I have not understood more, and I am afraid I will not unless you show the **exact** output expected for the sample data... I would be glad to help you, but you should also help me to help you :-) – Serge Ballesta Feb 17 '20 at 10:48
  • 1,1,1,1,1,14, in this there no 12,12 still its mark 1 i have to find 12,12 appeared in this row or not , same for other check how many times 12,3 appear , 2,12 appeared in all the actual data column – Quicklearner.gk Feb 17 '20 at 10:56
  • Ok, I think I begin to understand. But how do you count only 1 `8,8` in `8,8,8,8,8,8,`? I would expect 3... – Serge Ballesta Feb 17 '20 at 11:05
  • i changed it i got it by mistake its 3 only – Quicklearner.gk Feb 17 '20 at 11:10
  • Serge Ballesta : is there anything you got – Quicklearner.gk Feb 17 '20 at 11:45
  • why i am not getting in actual order it were in final_unique column ? also can you explain what you did? – Quicklearner.gk Feb 17 '20 at 13:48
  • 14,14,14,14,14,14, 0 5 0 3 @Serge Ballesta its giving 5 for 1,4 which is wrong here , it should give only 3 for 14,14 – Quicklearner.gk Feb 17 '20 at 13:52
  • @Quicklearner: you have said that the `Final_Unique` column had only 266 values. But all columns in a dataframe have the exact same number of rows. So I assumed that the column could contain duplicates or NaN values and used a set to remove duplicates (but it also breaks order) and `dropna()` to remove NaN. If you had described you data in a non ambiguous way, I should not have to guess... – Serge Ballesta Feb 17 '20 at 13:55
  • For the line 14.14.14.14.14.14, could you share a data sample exhibiting the problem? I need a [mcve] to reproduce the problems. The SO rules goal is not to bore askers but to make sure that answers are relevant and addresses all the problems. – Serge Ballesta Feb 17 '20 at 14:01
  • I am really very sorry for this I don’t know stackoverflow well to write a gud question I apologise for this – Quicklearner.gk Feb 17 '20 at 14:13
  • @Quicklearner: That's the reason why I tell you it and give you those 2 links. You should also read [ask] if you have not before. That way your next question will be nicer... BTW we all used to be beginners :-) – Serge Ballesta Feb 17 '20 at 14:25
  • i have updated the output i am getting from the code & thanks alot for your help here – Serge Ballesta – Quicklearner.gk Feb 17 '20 at 14:38
  • @Quicklearner: I have reverted your edit because it did not make sense in the middle of my answer. But it was quite relevant and I think I have seen where the problem comes from. I should post an edit soon... – Serge Ballesta Feb 17 '20 at 14:54
  • @Quicklearner: I could reproduce and fix. Please see my edit. For the column order, if you are sure that no duplicates exist, just remove the `set`: `for col in a['Final_Unique'].dropna():` – Serge Ballesta Feb 17 '20 at 15:13
  • Thanks alot for the help also can you please explain what you did here, its working fine, i am new here how can get your help regularly :) – Serge Ballesta – Quicklearner.gk Feb 17 '20 at 15:31
  • I am getting an error while performing k- means clustering ok data set of can’t convert string to float or int – Quicklearner.gk Feb 19 '20 at 15:01
0

I would do the following:

import re
l1 = list(a['Actual_Data'])
l2 = list(a['Final_Unique'])
d = {i: [len(re.findall(i,j)) for j in l1] for i in l2}
newdf = pd.DataFrame(d,index=l1)

if you have an issue with data types:

import re
l1 = list(a['Actual_Data'])
l2 = list(a['Final_Unique'])
d = {i: [len(re.findall(str(i),str(j))) for j in l1] for i in l2}
newdf = pd.DataFrame(d,index=l1)

What I've done here is forced every value of Actual_Data and Final_Unqiue into a string.

Jim Eisenberg
  • 1,490
  • 1
  • 9
  • 17