0

I have two dataframes that store the same data but in different formats. The first is in a long format:

import pandas as pd

d1 = {'Chr': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
      'region': [4367323, 4367323, 12070292, 12070292, 3877874, 3877874, 7350493, 7350493, 34537179, 34537179,
                 82280224, 82280224],
      'db_SNP': ["rs1490413", "rs1490413", "rs730123", "rs730123", "rs1563172", "rs1563172", "rs4669155", "rs4669155",
                 "rs1009480", "rs1009480", "rs13087528", "rs13087528"],
      'alleles': ["A", "G", "A", "G", "T", "C", "A", "G", "T", "C", "C", "T"],
      'count': [20, 833, 20, 976, 259, 307, 0, 1290, 578, 18, 731, 21],
      'coverage': [855, 855, 1002, 1002, 569, 569, 1294, 1294, 599, 599, 755, 755],
      'frequency': [2.339181287, 97.4269005, 1.996007984, 97.40518962, 45.51845343, 53.9543058, 0, 99.69088099,
                    96.49415693, 3.005008347, 96.82119205, 2.781456954]
      }

df1 = pd.DataFrame(data=d1)

print(df1)

    Chr    region      db_SNP alleles  count  coverage  frequency
0     1   4367323   rs1490413       A     20       855   2.339181
1     1   4367323   rs1490413       G    833       855  97.426901
2     1  12070292    rs730123       A     20      1002   1.996008
3     1  12070292    rs730123       G    976      1002  97.405190
4     2   3877874   rs1563172       T    259       569  45.518453
5     2   3877874   rs1563172       C    307       569  53.954306
6     2   7350493   rs4669155       A      0      1294   0.000000
7     2   7350493   rs4669155       G   1290      1294  99.690881
8     3  34537179   rs1009480       T    578       599  96.494157
9     3  34537179   rs1009480       C     18       599   3.005008
10    3  82280224  rs13087528       C    731       755  96.821192
11    3  82280224  rs13087528       T     21       755   2.781457

and the second is in a wide format:

d2 = {'Chr': [1, 1, 2, 2, 3, 3],
      'region': [4367323, 12070292, 3877874, 7350493, 34537179, 82280224],
      'db_SNP': ["rs1490413", "rs730123", "rs1563172", "rs4669155", "rs1009480", "rs13087528"],
      'alleles.1': ["A", "A", "T", "A", "T", "C"],
      'alleles.2': ["G", "G", "C", "G", "C", "T"],
      'count.1': [20, 20, 259, 0, 578, 731],
      'count.2': [833, 976, 307, 1290, 18, 21],
      'coverage': [855, 1002, 569, 1294, 599, 755],
      'frequency.1': [2.339181287, 1.996007984, 45.51845343, 0, 96.49415693, 96.82119205],
      'frequency.2': [97.4269005, 97.40518962,  53.9543058, 99.69088099, 3.005008347, 2.781456954]
      }

df2 = pd.DataFrame(data=d2)

print(df2)
   Chr    region      db_SNP alleles.1 alleles.2  count.1  count.2  coverage  frequency.1  frequency.2
0    1   4367323   rs1490413         A         G       20      833       855     2.339181    97.426901
1    1  12070292    rs730123         A         G       20      976      1002     1.996008    97.405190
2    2   3877874   rs1563172         T         C      259      307       569    45.518453    53.954306
3    2   7350493   rs4669155         A         G        0     1290      1294     0.000000    99.690881
4    3  34537179   rs1009480         T         C      578       18       599    96.494157     3.005008
5    3  82280224  rs13087528         C         T      731       21       755    96.821192     2.781457

I need to do a format conversions between these dataframes, preferably in pandas.
Can someone help, please?

Milos
  • 27
  • 4

0 Answers0