I have written a code which gets the highlighted text from PDFs on to a list in python. The list that I get from this code is
list = ['Holding / Market Nominal Value % of Net Value Investment £’000, Assets UNITED KINGDOM: 81.05,% (79.18,%) (continued) Support Services: 2.40,% (1.82,%) 1, 263, 826, DWF 1, 340, 0.84, 275 , 698, Equiniti 491, 0.31, 112, 248, Inchcape 947, 0.59, 1, 573, 663, Speedy Hire 1, 054, 0.66, 3, 832, 2.40, Tobacco: 4.04,% (4.90,%) 56, 365, British American Tobacco 1, 541, 0.96, 318, 088, Imperial Brands 4, 906, 3.08, 6, 447, 4.04, Travel & Leisure: 0.99,% (0.47,%) 470, 000, Mitchells & Butlers 1, 332, 0.84, 92, 594, National Express 245, 0.15, 1, 577, 0.99, Futures: (0.03,%) ((0.04,%)) 48, FTSE 100, Index Future Expiry September 2021, (47,) (0.03,) Portfolio of investments* 154, 700, 97.02, Net other assets 4, 745, 2.98, Net assets 159, 445, 100.00, ']
attaching an image of the output list from the pdf highlight just to give you an idea.
As soon as I create a dataframe from this list, I lose a lot of value. This is how my dataframe is created.
for i in range(len(list)):
info = list[i].split(',')
df = pd.DataFrame(info)
print(df.head(10))
print(df.shape)
which gives me the output like
0
0 Holding / Market Nominal Value % of Net Value ...
1 Assets UNITED KINGDOM: 81.05
2 % (79.18
3 %) (continued) Support Services: 2.40
4 % (1.82
5 %) 1
6 263
7 826
8 DWF 1
9 340
(74, 1)
which is incorrect as data is lost. How do I create a dataframe which looks exactly same as in the image provided above. Please help me out as I am not finding out a solution and have possibly tried everything to make it work.