3

I have a file similar to this:

RANDOMTEXTSAMPLE*
$SAMPLERANDOMTEXT
RANDOMSAMPLE*TEXT

I'm trying to extract and put into a list all instances of "sample" that have * at the end.

I tried with something like this:

import re

with open('file1.txt') as myfile:
content = myfile.read()

text = re.search(r'[0-9A-Z]{7}\*', content)
with open("file2.txt", "w") as myfile2:
myfile2.write(text)

However I would only get me the first result it found.

Any recommendations on how can I get all the instances of sample that end with * in a list, without adding the * to the list will be appreciated.

Thanks

EDIT: small corrections

motionsickness
  • 157
  • 1
  • 2
  • 10

2 Answers2

3

You can try this:

import re

samples = []

with open('file1.txt') as myfile:
    for line in myfile.readlines():
        if re.search(r'[0-9A-Z]{6}\*', line):                
            samples.append(line)

# print('SAMPLES: ', samples)

with open("file2.txt", "w") as myfile2:
    for s in samples:
        myfile2.write(s)
Nurjan
  • 5,889
  • 5
  • 34
  • 54
  • Thanks. I do have seen some examples like this but sample is a group of 7 alphanumeric characters. That's why I went to regex. Also, I need them without the *. There are other instances that are similar to sample but don't have a * at the end and I don't need those. – motionsickness Jun 23 '17 at 06:05
  • @motionsickness Ahh, ok. I will edit the answer. I though you were looking for the word `SAMPLE*` only ))). – Nurjan Jun 23 '17 at 06:06
  • Beautiful. Two questions. Does line 3 serve any purpose? do you know if there's any way that the samples get added without the *? I can just remove them from the file later with a replace, but i was wondering if it was possible to add them without it – motionsickness Jun 23 '17 at 06:38
  • @motionsickness No, line 3 which contained variable `text = ''` is not needed. Before putting lines into the `samples` list you can use `line =line.replace('*', '')`. However, the whole line including `sample` and other words will remain, but the asterisk will be removed. – Nurjan Jun 23 '17 at 06:45
0

From your question it is not clear if you want to match dollar sign at the end, or asterisk sign at the end, in any case you can solve problem using back references back-reference. If you don't know what they are, you can read more about back-references here.

import re
with open ("file1.txt", "r") as myfile:

    samples = []
    pattern = re.compile(r'([a-zA-Z]+)\*') 
    for line in myfile.readlines():

        for matched_object in pattern.finditer(line):
           samples.append(matched_object.group(1))

This would give you list of samples. You can see regex demo here.

Note: Since it is not clear what are you trying to match, you may need to modify back reference in my regular expression in order to match your concrete input. Anyway, this code snipet should give you overall idea how this problem can be solved.

Aleksandar Makragić
  • 1,957
  • 17
  • 32
  • Thanks! I did mess up my question a bit which made it a bit confusing to read. I'll make sure read about the back-references! – motionsickness Jun 23 '17 at 06:41