2

I have dataframe

    used_at  common users                     pair of websites
0      2014          1364                   avito.ru and e1.ru
1      2014          1716                 avito.ru and drom.ru
2      2014          1602                 avito.ru and auto.ru
3      2014           299           avito.ru and avtomarket.ru
4      2014           579                   avito.ru and am.ru
5      2014           602             avito.ru and irr.ru/cars
6      2014           424       avito.ru and cars.mail.ru/sale
7      2014           634                    e1.ru and drom.ru
8      2014           475                    e1.ru and auto.ru
9      2014           139              e1.ru and avtomarket.ru
10     2014           224                      e1.ru and am.ru
11     2014           235                e1.ru and irr.ru/cars
12     2014           154          e1.ru and cars.mail.ru/sale
13     2014           874                  drom.ru and auto.ru
14     2014           247            drom.ru and avtomarket.ru
15     2014           394                    drom.ru and am.ru
....

When I write graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users') I get

used_at                                2014    2015
pair of websites                                   
am.ru and cars.mail.ru/sale           166.0     NaN
am.ru and irr.ru/cars                 223.0     NaN
auto.ru and am.ru                     408.0   224.0
auto.ru and avtomarket.ru             243.0   162.0
auto.ru and cars.mail.ru/sale         330.0   195.0
auto.ru and drom.ru                     NaN   799.0
auto.ru and irr.ru/cars               409.0   288.0
avito.ru and am.ru                    579.0   262.0
....

And I have NaN because some of sequence are different. For example I have for 2014 I have am.ru and cars.mail.ru/sale but to 2015 I have cars.mail.ru/sale and am.ru. How can I change that?

Add my code

import pandas as pd
import itertools
import matplotlib.pyplot as plt

df = pd.read_csv("avito_trend.csv", parse_dates=[2])


def f(df):
    dfs = []
    for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:

        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        c2 = df.loc[df['address'].isin([x[1]]), 'ID']
        c = pd.Series(list(set(c1).intersection(set(c2))))
        dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
    return pd.concat(dfs)

common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
print common_users

graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
print graph_by_common_users

Picture with graph

NineWasps
  • 2,081
  • 8
  • 28
  • 45
  • Please clarify what do you want to have after pivoting instead of NaNs? – MaxU - stand with Ukraine Mar 28 '16 at 12:28
  • I have `am.ru and cars.mail.ru/sale ` to `2014` and `cars.mail.ru/sale and am.ru ` for `2015`. When I try print graph `rects = ax.patches labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] for rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom') ` I have an error, because I have `NaN`. So I want to reverse some string, for example `am.ru and cars.mail.ru/sale` to `2014` and `2015` – NineWasps Mar 28 '16 at 12:32
  • Was my explanation clear? – NineWasps Mar 28 '16 at 12:39

2 Answers2

2

Maybe before pivoting, try splitting on " and " and then sorting so every column is in the same order:

df['pair of websites'] = df['pair of websites'].str.split(' and ')
df['pair of websites'] = df['pair of websites'].apply(lambda x: frozenset(sorted(x)))

Seems like that should work as long as theres the same amount of whitespace in the " and " part for each entry. If not, you may have to use str.strip() as well.

Greg Friedman
  • 341
  • 4
  • 11
  • It doesn't change anything – NineWasps Mar 27 '16 at 21:02
  • @ldevyataykina It should have if you did that before pivoting. If not, I'd be curious to see the output so that I can see what went wrong. – Greg Friedman Mar 27 '16 at 21:20
  • The first string return `TypeError: unhashable type: 'list'` – NineWasps Mar 27 '16 at 21:23
  • @ldevyataykina My solution is edited to address that issue. Please let me know if it works now. – Greg Friedman Mar 27 '16 at 22:14
  • the first string works only in loop. but second return `AttributeError: 'list' object has no attribute 'apply'`. my code is `common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index() for pairs in common_users['pair of websites']: pair = pairs.split(' and ') print pair.apply(lambda x: frozenset(sorted(x))) ` – NineWasps Mar 27 '16 at 22:23
  • @ldevyataykina If you're iterating, then you don't use the apply method. The way you're doing it is a bit strange, but if you want to do what you're doing, then change `print pair.apply(lambda x: frozenset(sorted(x))` to `print frozenset(sorted(pair))` – Greg Friedman Mar 27 '16 at 22:49
  • I can't do this without iterating. your decision doesn't work. `df['pair of websites'] = df['pair of websites'].str.split(' and ')` it returns error. Maybe file can help and my code [file](https://yadi.sk/d/A_9pykgVqJ6Ky) – NineWasps Mar 27 '16 at 23:11
  • `import pandas as pd import itertools df = pd.read_csv("avito_trend.csv", parse_dates=[2]) def f(df): dfs = [] for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]: c1 = df.loc[df['address'].isin([x[0]]), 'ID'] c2 = df.loc[df['address'].isin([x[1]]), 'ID'] c = pd.Series(list(set(c1).intersection(set(c2)))) dfs.append(pd.DataFrame({'common users':len(c), 'pair ofwebsites':' and '.join(x)}, index=[0])) return pd.concat(dfs) common_users =df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()` – NineWasps Mar 27 '16 at 23:14
  • @Idevyataykina just change `c = pd.Series(list(set(c1).intersection(set(c2))))` to `c = pd.Series(sorted(list(set(c1).intersection(set(c2))))) ` – Greg Friedman Mar 28 '16 at 11:50
  • It doesn't change anything( – NineWasps Mar 28 '16 at 12:21
1

After testing I add inverted combinations c_invert, because some values were missing after pivot. Now there are all combination and pivot works very well:

df = pd.read_csv("avito_trend.csv", 
                      parse_dates=[2])


def f(df):
    dfs = []
    for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:

        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        c2 = df.loc[df['address'].isin([x[1]]), 'ID']
        c = pd.Series(list(set(c1).intersection(set(c2))))
        #add inverted intersection c2 vs c1
        c_invert = pd.Series(list(set(c2).intersection(set(c1))))
        dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
        #swap values in x
        x[1],x[0] = x[0],x[1]
        dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x)}, index=[0]))
    return pd.concat(dfs)

common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
print common_users.pivot(index='pair of websites', columns='used_at', values='common users')
used_at                              2014  2015
pair of websites                               
am.ru and auto.ru                     408   224
am.ru and avito.ru                    579   262
am.ru and avtomarket.ru               133    72
am.ru and cars.mail.ru/sale           166    73
am.ru and drom.ru                     394   187
am.ru and e1.ru                       224    99
am.ru and irr.ru/cars                 223   102
auto.ru and am.ru                     408   224
auto.ru and avito.ru                 1602  1473
auto.ru and avtomarket.ru             243   162
auto.ru and cars.mail.ru/sale         330   195
auto.ru and drom.ru                   874   799
auto.ru and e1.ru                     475   451
auto.ru and irr.ru/cars               409   288
avito.ru and am.ru                    579   262
avito.ru and auto.ru                 1602  1473
avito.ru and avtomarket.ru            299   205
avito.ru and cars.mail.ru/sale        424   256
avito.ru and drom.ru                 1716  1491
avito.ru and e1.ru                   1364  1153
avito.ru and irr.ru/cars              602   403
avtomarket.ru and am.ru               133    72
avtomarket.ru and auto.ru             243   162
avtomarket.ru and avito.ru            299   205
avtomarket.ru and cars.mail.ru/sale   105    48
avtomarket.ru and drom.ru             247   175
avtomarket.ru and e1.ru               139   105
avtomarket.ru and irr.ru/cars         139    73
cars.mail.ru/sale and am.ru           166    73
cars.mail.ru/sale and auto.ru         330   195
cars.mail.ru/sale and avito.ru        424   256
cars.mail.ru/sale and avtomarket.ru   105    48
cars.mail.ru/sale and drom.ru         292   189
cars.mail.ru/sale and e1.ru           154   105
cars.mail.ru/sale and irr.ru/cars     197    94
drom.ru and am.ru                     394   187
drom.ru and auto.ru                   874   799
drom.ru and avito.ru                 1716  1491
drom.ru and avtomarket.ru             247   175
drom.ru and cars.mail.ru/sale         292   189
drom.ru and e1.ru                     634   539
drom.ru and irr.ru/cars               423   277
e1.ru and am.ru                       224    99
e1.ru and auto.ru                     475   451
e1.ru and avito.ru                   1364  1153
e1.ru and avtomarket.ru               139   105
e1.ru and cars.mail.ru/sale           154   105
e1.ru and drom.ru                     634   539
e1.ru and irr.ru/cars                 235   148
irr.ru/cars and am.ru                 223   102
irr.ru/cars and auto.ru               409   288
irr.ru/cars and avito.ru              602   403
irr.ru/cars and avtomarket.ru         139    73
irr.ru/cars and cars.mail.ru/sale     197    94
irr.ru/cars and drom.ru               423   277
irr.ru/cars and e1.ru                 235   148

If you need graph:

graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#sort by column 2014
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)



ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]


rects = ax.patches 
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] 
for rect, label in zip(rects, labels): 
    height = rect.get_height() 
    ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8) 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Glad can help you. :) Nice day. – jezrael Mar 28 '16 at 12:48
  • I want to print graph with mean but it doesn't work `rects = ax.patches labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] for rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8) ` What am I doing wrong? – NineWasps Mar 28 '16 at 13:40
  • How do you count `mean`? – jezrael Mar 28 '16 at 13:49
  • I want to print mean of column `b`. like this [link](https://silvrback.s3.amazonaws.com/uploads/ae516a51-06ae-400d-a93d-5c2ba3d3590f/delays_large.png) – NineWasps Mar 28 '16 at 13:55
  • So I think you can add new question, because I dont know how you want to count `mean`? What is desired output of this input? Because `mean` is one scalar value from one column (one Serie). Please explain it better. – jezrael Mar 28 '16 at 14:02
  • I don't want to count mean. I want to print `b` to every column. For example opposite column `am.ru and auto.ru ` print `408` – NineWasps Mar 28 '16 at 14:11
  • Ok, I try http://stackoverflow.com/q/36111556/2901002 and there is only one problem - many values, so graph is no nice. `ax = graph_by_common_users.plot(kind='bar', width=0.5) [label.set_rotation(25) for label in ax.get_xticklabels()] rects = ax.patches labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] for rect, label in zip(rects, labels): height = rect.get_height() ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8) ` – jezrael Mar 28 '16 at 14:18
  • and I forget add before code above: `graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')` – jezrael Mar 28 '16 at 14:19
  • It starts from `[` `[label.set_rotation(25)` ? Can you add code to you answer? – NineWasps Mar 28 '16 at 14:26
  • and can you make descending ? – NineWasps Mar 28 '16 at 14:43
  • Is it real to sort that by `2014` and `2015` at the same time? And do it from max to min? – NineWasps Mar 28 '16 at 14:56
  • It is impossible sort both, you can sort by `2014` as in picture or by `2015`. – jezrael Mar 28 '16 at 14:57
  • Understand. But how can do it from max to min? – NineWasps Mar 28 '16 at 14:58
  • is starts from min and ends with max) top small values ​​and large at the bottom. but should on the contrary – NineWasps Mar 28 '16 at 15:04
  • Yes, understand. Remove `ascending=False` or change it `ascending=True` – jezrael Mar 28 '16 at 15:08
  • When I print `common users` I get all dublicate strings. `used_at common users pair of websites 0 2014 1364 avito.ru and e1.ru 1 2014 1364 e1.ru and avito.ru 2 2014 1716 avito.ru and drom.ru 3 2014 1716 drom.ru and avito.ru 4 2014 1602 avito.ru and auto.ru 5 2014 1602 auto.ru and avito.ru 6 2014 299 avito.ru and avtomarket.ru` How can I fix that? – NineWasps Mar 29 '16 at 10:52
  • You can use `print common_users.groupby(common_users.index / 2).first()` or `print common_users.groupby(common_users.index / 2).last()` – jezrael Mar 29 '16 at 11:01
  • And how can I organize , to displayed that pair like the first I print pair `avito.ru and e1.ru` and her result to `2014`, and next `2015`. Like `0 auto.ru -> am.ru 2014 103 1 auto.ru -> am.ru 2015 135 2 avito.ru -> am.ru 2014 133 3 avito.ru -> am.ru 2015 31 4 avtomarket.ru -> am.ru 2014 14 5 avtomarket.ru -> am.ru 2015 6` – NineWasps Mar 29 '16 at 11:09
  • Maybe `print common_users.groupby(common_users.index / 2).last().sort_values('pair of websites')` – jezrael Mar 29 '16 at 11:18
  • `19 2014 408 am.ru and auto.ru 45 2015 224 am.ru and auto.ru 34 2015 262 am.ru and avito.ru 4 2014 579 am.ru and avito.ru 22 2014 133 am.ru and avtomarket.ru` How to do the first is `2014` and second `2015`? It mixed – NineWasps Mar 29 '16 at 11:27
  • Why this graph print not all values? Some values to `pairs` not printed – NineWasps Mar 30 '16 at 13:06
  • What values dont print? – jezrael Mar 30 '16 at 13:08
  • I add this picture with graph to my question – NineWasps Mar 30 '16 at 13:11
  • for some pairs don't print value to `2014` or `2015` year – NineWasps Mar 30 '16 at 13:12
  • Maybe you can change parameter figsize, but for me it works very well. – jezrael Mar 30 '16 at 13:23
  • All pairs at the graph are duplicated. How can print in graph unique pairs? – NineWasps Mar 30 '16 at 13:42
  • Ok, I found solution. Can you create new question about sorting and plotting with your account `ldevyataykina` ? Because if I add my solution to your new question from user `user116873`, nobody know why I use my code, because my code is connected with this question. And why `print common_users.groupby(common_users.index / 2).first()` dont work - I dont know. – jezrael Mar 30 '16 at 14:33
  • I can't. creating issues limit is exceeded( – NineWasps Mar 30 '16 at 14:53
  • Phuuuu, what limit? limit to your account `ldevyataykina` ? – jezrael Mar 30 '16 at 14:54
  • Hmmm, maybe you can then change this question - http://stackoverflow.com/questions/36112425/selecting-columns-by-pandas – jezrael Mar 30 '16 at 14:56
  • Okey. Just a minute, please – NineWasps Mar 30 '16 at 14:59
  • Phhhu, but maybe user with his answer will be angry, maybe you can explain, why you change question (`creating issues limit is exceeded`) and dont forget add your graph to question - http://i.stack.imgur.com/lPoME.png – jezrael Mar 30 '16 at 15:00
  • No problem. And if you want, I can edit your new question - maybe then will be better and dont get downvote. – jezrael Mar 30 '16 at 15:09
  • http://stackoverflow.com/questions/36112425/sort-strings-by-pandas-and-print-that-to-graph I edit that. But I think it's not good – NineWasps Mar 30 '16 at 15:10