I have two dataframes that store the same data but in different formats. The first is in a long format:
import pandas as pd
d1 = {'Chr': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'region': [4367323, 4367323, 12070292, 12070292, 3877874, 3877874, 7350493, 7350493, 34537179, 34537179,
82280224, 82280224],
'db_SNP': ["rs1490413", "rs1490413", "rs730123", "rs730123", "rs1563172", "rs1563172", "rs4669155", "rs4669155",
"rs1009480", "rs1009480", "rs13087528", "rs13087528"],
'alleles': ["A", "G", "A", "G", "T", "C", "A", "G", "T", "C", "C", "T"],
'count': [20, 833, 20, 976, 259, 307, 0, 1290, 578, 18, 731, 21],
'coverage': [855, 855, 1002, 1002, 569, 569, 1294, 1294, 599, 599, 755, 755],
'frequency': [2.339181287, 97.4269005, 1.996007984, 97.40518962, 45.51845343, 53.9543058, 0, 99.69088099,
96.49415693, 3.005008347, 96.82119205, 2.781456954]
}
df1 = pd.DataFrame(data=d1)
print(df1)
Chr region db_SNP alleles count coverage frequency
0 1 4367323 rs1490413 A 20 855 2.339181
1 1 4367323 rs1490413 G 833 855 97.426901
2 1 12070292 rs730123 A 20 1002 1.996008
3 1 12070292 rs730123 G 976 1002 97.405190
4 2 3877874 rs1563172 T 259 569 45.518453
5 2 3877874 rs1563172 C 307 569 53.954306
6 2 7350493 rs4669155 A 0 1294 0.000000
7 2 7350493 rs4669155 G 1290 1294 99.690881
8 3 34537179 rs1009480 T 578 599 96.494157
9 3 34537179 rs1009480 C 18 599 3.005008
10 3 82280224 rs13087528 C 731 755 96.821192
11 3 82280224 rs13087528 T 21 755 2.781457
and the second is in a wide format:
d2 = {'Chr': [1, 1, 2, 2, 3, 3],
'region': [4367323, 12070292, 3877874, 7350493, 34537179, 82280224],
'db_SNP': ["rs1490413", "rs730123", "rs1563172", "rs4669155", "rs1009480", "rs13087528"],
'alleles.1': ["A", "A", "T", "A", "T", "C"],
'alleles.2': ["G", "G", "C", "G", "C", "T"],
'count.1': [20, 20, 259, 0, 578, 731],
'count.2': [833, 976, 307, 1290, 18, 21],
'coverage': [855, 1002, 569, 1294, 599, 755],
'frequency.1': [2.339181287, 1.996007984, 45.51845343, 0, 96.49415693, 96.82119205],
'frequency.2': [97.4269005, 97.40518962, 53.9543058, 99.69088099, 3.005008347, 2.781456954]
}
df2 = pd.DataFrame(data=d2)
print(df2)
Chr region db_SNP alleles.1 alleles.2 count.1 count.2 coverage frequency.1 frequency.2
0 1 4367323 rs1490413 A G 20 833 855 2.339181 97.426901
1 1 12070292 rs730123 A G 20 976 1002 1.996008 97.405190
2 2 3877874 rs1563172 T C 259 307 569 45.518453 53.954306
3 2 7350493 rs4669155 A G 0 1290 1294 0.000000 99.690881
4 3 34537179 rs1009480 T C 578 18 599 96.494157 3.005008
5 3 82280224 rs13087528 C T 731 21 755 96.821192 2.781457
I need to do a format conversions between these dataframes, preferably in pandas.
Can someone help, please?