Regular expression python 4 integers followed by a space

Question

I am trying to split a file each time that there are exactly 4 integers followed by a space on a line in ht efile. I think I am almost there (looking at all questions and examples). Think I need a last push. Could anyone help me out.

The script splits all lines that start with 4 integers. It needs to only split when its just 4 integers and not more then 4.

import re
file = open('test.txt', 'r')

Try 1

for x in file.read().split(re.match(r"[0-9]{4}\s", file.readline())):
       print (x)

Try 2

for x in file.read().split(re.match(r"[0-9][0-9][0-9][0-9]\s", file.readline())):
       print (x)

try 3

for x in re.split(r"[0-9]{4}\s", file.read()):
    print (x)

Sample input

1020                                                                                                                                                                                                                                                            
200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla
1030
200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

Wished for output is the above content split in:

200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

and

200123242151111231                                 bla             bla                                       bla
200123331231231441                                 bla             bla                                       bla

What is in the test file? What results give you your current solutions? What is wrong with those results? — mrzasa, Apr 06 '18 at 07:51
268 mb is the file it should eventually work on, but I have a testfile of 36 mb — Zuenie, Apr 06 '18 at 07:55

Alex Hall · Accepted Answer · 2018-04-06T08:34:58.730

3

re.match(r"[0-9]{4}\s", file.readline())

This reads one line of the file and matches the regex against it,. .split(...) then uses the result of that as a static delimiter to split the entire file. This has no relation to what you want to achieve.

(it actually doesn't even do that because the entire file has already been read, but that's not the point)

Perhaps you were thinking of doing something like .split(re.compile(...))? In any case that doesn't work either, str.split doesn't deal with regexes.

Try re.split(r"\b[0-9]{4}\s+", file.read()) to split the file into pieces separated by 4-digit numbers. The \b means 'word boundary' and prevents it from splitting on 4 digits that are just the ends of longer numbers. Note that if your file starts with a 4-digit number, the first piece will be empty.

edited Apr 06 '18 at 08:34

answered Apr 06 '18 at 08:02

Alex Hall

34,833
5
57
89

Ok that makes a lot of sense. Still it gives back a lot of lines that have a large integer at the start. It does not seem to take the space as a factor, the \s. My code looks like this now: for x in re.split(r"[0-9]{4}\s", file.read()): print (x) for x in file.read().split(re.match(r"[0-9]{4}\s", file.read())): print (x) – Zuenie Apr 06 '18 at 08:10
Your code is what I had in mind, but I realise now that I answered a bit hastily because I don't think I quite understood what you're trying to do, hence I asked for sample input and output. – Alex Hall Apr 06 '18 at 08:11

score 0 · Answer 2 · answered Apr 06 '18 at 07:57

You read the file with readline and it reads line by line, splitting file on newlines.

If the file is not very big, you can read it at once, e.g.

with open(file_path, 'r') as file:
    content = file.read()

(see this answer)

and then apply the regexp.

Regular expression python 4 integers followed by a space

2 Answers2