I'm extracting information from PDF-file into a string. When coming across text that are structures in the pdf as tables the extracted text is then divided into the way the reader comes across the line and not cell by cell in the table row.
After reading and searching for hours I would like to get some tips on how should i approach this problem to get the string structured in the way shown bellow?
Current string:
Difenylmetandiisocyanat 9016-87-9 Acute Tox. 4; H332 >= 10 - < 20
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373
4,4'-metylendifenyldiisocyanat 101-68-8 Acute Tox. 4; H332 >= 10 - < 20
202-966-0 Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373
Desired structure:
Difenylmetandiisocyanat
9016-87-9
Acute Tox. 4; H332
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373
>= 10 - < 20
4,4'-metylendifenyldiisocyanat
101-68-8
202-966-0
Acute Tox. 4; H332
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373
>= 10 - < 20