Suppose I have the following text:
doc = '\ufeff6.\tA B C\n\n6.1\tStuff1\n\n6.2\tStuff2\n\n(a)\tSubstuff1\n\n(b)\tSubstudff2\nblabla\n\n6.3\tbla bla\n\nbla bla\n\n6.4\thola hola\n\n\n\n7\n\x0c7.\tX Y Z\n\n7.1\tbla bla bla.\n\n7.2\tbla bla bla.\n\n7.3\tbla bla 1\n\n7.4\tbla bla bla \n\nand more bla bla bla\n\n7.5\tstuff\n\n8.\tMNO\n\n8.1\tbla bla \n\n(a)\tbla bla;\n\n(b)\tbla bla,\n\n8\n\x0cExtra\n\n(c)\tExtra1\n\n8.2\tExtra2\n'
which looks like
6. A B C
6.1 Stuff1
6.2 Stuff2
(a) Substuff1
(b) Substudff2
blabla
6.3 bla bla
bla bla
6.4 hola hola
7
7. X Y Z
7.1 bla bla bla.
7.2 bla bla bla.
7.3 bla bla 1
7.4 bla bla bla
and more bla bla bla
7.5 stuff
8. MNO
8.1 bla bla
(a) bla bla;
(b) bla bla,
8
Extra
(c) Extra1
8.2 Extra2
The idea is to pick up the section 7. However since there are many new lines, tabs and special characters such as \x0c
and \ufeff
, I decided to clean those up as follow:
d1 = doc3.replace("\n", "")
d2 = d1.replace("\n\n", "")
d3 = d2.replace("\t", "")
d4 = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', d3)
d5 = d4.replace("\ufeff", "")
so now d5
looks like:
'6.A B C6.1Stuff16.2Stuff2(a)Substuff1(b)Substudff2blabla6.3bla blabla bla6.4hola hola77.X Y Z7.1bla bla bla.7.2bla bla bla.7.3bla bla 17.4bla bla bla and more bla bla bla7.5stuff8.MNO8.1bla bla (a)bla bla;(b)bla bla,8Extra(c)Extra18.2Extra2'
now I want to pick X Y Z
part such that I get:
'7.X Y Z7.1bla bla bla.7.2bla bla bla.7.3bla bla 17.4bla bla bla and more bla bla bla7.5stuff'
So I tried doing the following:
pattern = r"^(\d+\.)*X Y Z"
m = re.search(pattern, d5, re.MULTILINE)
if m:
print(m.group())
which doesn't pick any output. I wonder what am I doing wrong in my pattern
? Also is there a better way to compress d1
to d5
; I am sure my way is not clever?
Note: X Y Z position is not always at 7 and can vary so I can not do re.findall(r'^7\..*', doc1, re.MULTILINE)
.