-1

I'm trying to extract some information from a File.

File:

ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT

CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT

I wanna execute multiple pattern in this file but when I extract the first information the rest of them (File) comes empty.

import re
import pdb

w = open("extractfile.txt","r")

print w.read()
print re.findall(r'CONTENT', w.read())
print re.findall(r'\w{3} \d{2}-\d{2}-\d{2}', w.read())

Output:

ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT

CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT



[]
[]

If I change the print order, it always shows the first print, the rest of them comes empty... Another thing that I thought was Multiple Pattern in one line, by using groups, but I don't know if it would work

Shinomoto Asakura
  • 1,473
  • 7
  • 25
  • 45

1 Answers1

0
>>>> import re
>>>> with open('extractfile.txt', 'r') as txt:
....     file = txt.read()

>>>> match = re.findall(r'CONTENT', file)
>>>> content
['CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT']

>>>> pattern = re.findall(r'(?P<asd>[\w]+ )(?P<dgt>[\d-]+)', file)
>>>> pattern
[('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34')]

the whitespace after [\w]+ can also excluded from the <asd> group by moving it out, but it's slower since regex ends up doing more steps.

re.findall(r'(?P<asd>[\w]+) (?P<dgt>[\d-]+)', file)
deadvoid
  • 1,270
  • 10
  • 19