0

I have written a code which gets the highlighted text from PDFs on to a list in python. The list that I get from this code is

list = ['Holding / Market Nominal Value % of Net Value Investment £’000, Assets UNITED KINGDOM: 81.05,% (79.18,%) (continued) Support Services: 2.40,% (1.82,%) 1, 263, 826, DWF 1, 340, 0.84, 275 , 698, Equiniti 491, 0.31, 112, 248, Inchcape 947, 0.59, 1, 573, 663, Speedy Hire 1, 054, 0.66, 3, 832, 2.40, Tobacco: 4.04,% (4.90,%) 56, 365, British American Tobacco 1, 541, 0.96, 318, 088, Imperial Brands 4, 906, 3.08, 6, 447, 4.04, Travel & Leisure: 0.99,% (0.47,%) 470, 000, Mitchells & Butlers 1, 332, 0.84, 92, 594, National Express 245, 0.15, 1, 577, 0.99, Futures: (0.03,%) ((0.04,%)) 48, FTSE 100, Index Future Expiry September 2021, (47,) (0.03,) Portfolio of investments* 154, 700, 97.02, Net other assets 4, 745, 2.98, Net assets 159, 445, 100.00, ']

attaching an image of the output list from the pdf highlight just to give you an idea. enter image description here

As soon as I create a dataframe from this list, I lose a lot of value. This is how my dataframe is created.

            for i in range(len(list)):
                info = list[i].split(',')
                df = pd.DataFrame(info)
                print(df.head(10))
                print(df.shape)

which gives me the output like

                                                   0
0  Holding / Market Nominal Value % of Net Value ...
1                       Assets UNITED KINGDOM: 81.05
2                                           % (79.18
3              %) (continued) Support Services: 2.40
4                                            % (1.82
5                                               %) 1
6                                                263
7                                                826
8                                              DWF 1
9                                                340
(74, 1)

which is incorrect as data is lost. How do I create a dataframe which looks exactly same as in the image provided above. Please help me out as I am not finding out a solution and have possibly tried everything to make it work.

technophile_3
  • 531
  • 6
  • 21
  • There are [better](https://stackoverflow.com/questions/10300786/pdf-table-extraction/21434499) ways to extract a table from a pdf. Your approach will waste too much time recreating a table from an unstructured list. – RJ Adriaansen Jan 06 '22 at 07:30
  • @RJAdriaansen can tabula extract ONLY highlighted tables from PDF? I haven't encountered anything of that sort – technophile_3 Jan 06 '22 at 07:45

0 Answers0