1

I have a list of strings in a file. I am trying to extract a substring from each string and printing them. The strings look like the following -

Box1 is lifted\nInform the manufacturer
Box2 is lifted\nInform the manufacturer
Box3, Box4 is lifted\nInform the manufacturer
Box5, Box6 is lifted\nInform the manufacturer
Box7 is lifted\nInform the manufacturer

From each line I have to extract the string before \n and print them. I used the following Python regex to do that - term = r'.*-\s([\w\s]+)\\n' This regex works fine for the 1st, 2nd and last line. But it doesn't work for the 3rd and 4th lines since there is a , in the string. How should I modify my regex expression to fit in that?

Expected results -

Box1 is lifted
Box2 is lifted
Box3 Box4 is lifted
Box5 Box6 is lifted
Box7 is lifted

Results obtained currently -

Box1 is lifted
Box2 is lifted
Box2 is lifted
Box2 is lifted
Box7 is lifted
Gargi
  • 29
  • 2
  • Possible duplicate of [How can I split and parse a string in Python?](https://stackoverflow.com/questions/5749195/how-can-i-split-and-parse-a-string-in-python) – max Nov 29 '17 at 19:28
  • Do the strings contain newline characters, or do they contain a literal "\" followed by "n"? Your regex seems to suggest the latter, but a lot of the answers you've got are assuming the former. – ekhumoro Nov 29 '17 at 19:40

7 Answers7

2

If this is a consistent format, you could just split on the newline:

''.join(YOURSTRING.split('\n')[0].split(','))

Edited because I missed the part about removing the comma.

Aaron Lael
  • 188
  • 7
2

regex is overkill for basic string operations like this. Use the built-in string methods, like partition and replace:

for line in lines:
    first, sep, last = line.partition('\n')
    newline = first.replace(',','')
    print (newline)

Edit. In case \n is a literal sequence in a line read from a file, use r'\n' instead of '\n'.

Pulsar
  • 288
  • 1
  • 5
  • The OP is reading the strings from a file. By definition, a line cannot contain a newline, so your code cannot possibly work. – ekhumoro Nov 29 '17 at 19:34
2

The comma isn't part of either \W or \s character set.term = r'.*-\s([\w\s,]+)\\n' should do what you want.

Mateo
  • 1,781
  • 1
  • 16
  • 21
1

Why not something as simple as term = r"[*]*(is lifted)". Or don't use regex at all if not required. EDIT: I think this might be better term = r"(Box[0-9])?(, Box[0-9])*(is lifted)"

theBrainyGeek
  • 584
  • 1
  • 6
  • 17
1

What about something like this? :

from io import StringIO

ok = '''Box1 is lifted\\nInform the manufacturer
Box2 is lifted\\nInform the manufacturer
Box3, Box4 is lifted\\nInform the manufacturer
Box5, Box6 is lifted\\nInform the manufacturer
Box7 is lifted\\nInform the manufacturer
'''
ok = StringIO(ok)
strings = [' '.join(x.split()).replace('\\n', '').replace(',', '') for x in ok.split('Inform the manufacturer')]
>>> for x in strings: print x
... 
... 
Box1 is lifted
Box2 is lifted
Box3 Box4 is lifted
Box5 Box6 is lifted
Box7 is lifted
0

Let me know if the below works for you.

input="Box3, Box4 is lifted\nInform the manufacturer"
input=input.replace(",","",1)
print(input)
print(input[0:input.index("\n")])
input="Box1 is lifted\nInform the manufacturer"
print(input[0:input.index("\n")])
kemparaj565
  • 379
  • 3
  • 6
0

You can try regex and can capture the group:

One line solution:

import re
pattern=r'\w.+(?=\\n)'

print([re.search(pattern,line).group() for line in open('file','r')])

output:

['Box1 is lifted', 'Box2 is lifted', 'Box3, Box4 is lifted', 'Box5, Box6 is lifted', 'Box7 is lifted']

Detailed solution:

import re
pattern=r'\w.+(?=\\n)'
with open('newt','r') as f:
    for line in f:
        print(re.search(pattern,line).group())

output:

Box1 is lifted
Box2 is lifted
Box3, Box4 is lifted
Box5, Box6 is lifted
Box7 is lifted
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88