0

Suppose I have the following text:

doc = '\ufeff6.\tA B C\n\n6.1\tStuff1\n\n6.2\tStuff2\n\n(a)\tSubstuff1\n\n(b)\tSubstudff2\nblabla\n\n6.3\tbla bla\n\nbla bla\n\n6.4\thola hola\n\n\n\n7\n\x0c7.\tX Y Z\n\n7.1\tbla bla bla.\n\n7.2\tbla bla bla.\n\n7.3\tbla bla 1\n\n7.4\tbla bla bla \n\nand more bla bla bla\n\n7.5\tstuff\n\n8.\tMNO\n\n8.1\tbla bla \n\n(a)\tbla bla;\n\n(b)\tbla bla,\n\n8\n\x0cExtra\n\n(c)\tExtra1\n\n8.2\tExtra2\n'

which looks like

6.  A B C

6.1 Stuff1

6.2 Stuff2

(a) Substuff1

(b) Substudff2
blabla

6.3 bla bla

bla bla

6.4 hola hola



7
7.  X Y Z

7.1 bla bla bla.

7.2 bla bla bla.

7.3 bla bla 1

7.4 bla bla bla 

and more bla bla bla

7.5 stuff

8.  MNO

8.1 bla bla 

(a) bla bla;

(b) bla bla,

8
Extra

(c) Extra1

8.2 Extra2

The idea is to pick up the section 7. However since there are many new lines, tabs and special characters such as \x0c and \ufeff, I decided to clean those up as follow:

d1 = doc3.replace("\n", "")
d2 = d1.replace("\n\n", "")
d3 = d2.replace("\t", "")
d4 = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', d3)
d5 = d4.replace("\ufeff", "")

so now d5 looks like:

'6.A B C6.1Stuff16.2Stuff2(a)Substuff1(b)Substudff2blabla6.3bla blabla bla6.4hola hola77.X Y Z7.1bla bla bla.7.2bla bla bla.7.3bla bla 17.4bla bla bla and more bla bla bla7.5stuff8.MNO8.1bla bla (a)bla bla;(b)bla bla,8Extra(c)Extra18.2Extra2'

now I want to pick X Y Z part such that I get:

'7.X Y Z7.1bla bla bla.7.2bla bla bla.7.3bla bla 17.4bla bla bla and more bla bla bla7.5stuff'

So I tried doing the following:

pattern = r"^(\d+\.)*X Y Z"

m = re.search(pattern, d5, re.MULTILINE)
if m:
    print(m.group())

which doesn't pick any output. I wonder what am I doing wrong in my pattern? Also is there a better way to compress d1 to d5; I am sure my way is not clever?

Note: X Y Z position is not always at 7 and can vary so I can not do re.findall(r'^7\..*', doc1, re.MULTILINE).

Wiliam
  • 1,078
  • 10
  • 21
  • try your regexes in regex101.com - it should help a lot fine tuning them – pygri Nov 17 '21 at 15:28
  • 1
    This part `(\d+\.)*` repeats digits and a dot only. You should use `^(\d+\.).*?X Y Z` to match until the first occurrence of X.Y.Z (you can also omit the capture group) See https://regex101.com/r/rQ04zL/1 – The fourth bird Nov 17 '21 at 15:28
  • @Thefourthbird oh but then I am wrong I want to pick X Y Z part, namely `'7.X Y Z7.1bla bla bla.7.2bla bla bla.7.3bla bla 17.4bla bla bla and more bla bla bla7.5stuff'` – Wiliam Nov 17 '21 at 15:31
  • @Wiliam You might also use the pattern from your previous question, and remove all the characters afterwards from the match using `^(\d+\.)\s*X Y Z(?:\s*\n\1\b.*)*` https://regex101.com/r/YtJe3K/1 – The fourth bird Nov 17 '21 at 15:34
  • @Thefourthbird I have tried that before, as you can see from your link, it doesn't pick up the 7.5 as there is a new line on 7.4. This is why I cleaned. Would be very kind if you see my question thoroughly – Wiliam Nov 17 '21 at 15:37
  • this is not a duplicate - we don't want to select from beginning up until specific point. We are trying to pick the middle part where have section X Y Z. Would be great if you vote to reopen! – Wiliam Nov 17 '21 at 15:39
  • @Wiliam The you could first do the replacements, but not for the newlines or else you will loose the structure like https://ideone.com/9bh36I – The fourth bird Nov 17 '21 at 15:42
  • @Thefourthbird in my example I must pick from 7. XYZ all the way until the end of 7.5. As you can see in the original text 7.4 has a new line `'7.4\tbla bla bla \n\nand more bla bla bla\n\n7.5\tstuff'` that should not be there, this is why I am cleaning them. Also in the link above the 7.5 is not picked again. – Wiliam Nov 17 '21 at 15:47
  • 1
    @Wiliam I see, then it could be like this https://ideone.com/8ytYfP – The fourth bird Nov 17 '21 at 15:53
  • @Thefourthbird this indeed works and must say complicated code :) my worry is however is the generalisation of this. As in some docs you might not have the issue of new line, as in this example. I wonder if it is possible to do this without keeping the line? At the end I'll use the output for NLP and even the numbers will be deleted. So I wonder if one can do this for `d5` above (namely after all cleanings) – Wiliam Nov 17 '21 at 16:03
  • 1
    @Wiliam The thing is that the current value of d5 from the question does not have newlines anymore. So the string contains values like `hola77.X` and `bla 17.4bla ` which makes it hard to get a start and endpoint for the match. See the issue here https://regex101.com/r/QDdByQ/1 – The fourth bird Nov 17 '21 at 16:16
  • I see - maybe after all I should look into other methods for extractions. – Wiliam Nov 17 '21 at 18:31

0 Answers0