0

i have build this little code to grab the pkeys(numbers) from the string (string could be of 1 page also) and then display the unique pkeys row wise. for example if the string is "Failure: Cannot retrieve market data: [Historical correlation for instruments 48021088, 1029755
is older than 2M, Historical correlation for instruments 48021088, 1029755,5445454 is older than 2M, Error while loading Structured Product market data"

output should be: 48021088 1029755 5445454

but right now my output is like [4802108 1029755 2 544545]

Note : it should not take 2 as pkey from the string "2M" i.e 2 month

Also, whenever i copy a long string in this code, i have to put \ after the end of each line to make it run, is there something i can do as soon as i copy a string from outlook or any other source and paste in this code, it should automatically format it and insert \ by iteself.

enter image description here

import re
import numpy as np
import pandas as pd


regex = ('\d+')
match = re.findall(regex, 'Failure: Cannot retrieve market data: [Historical correlation for instruments 48021088, 1029755 \
is older than 2M, Historical correlation for instruments 48021088, 1029755 is older than 2M, Error while loading Structured Product market data \
Failure: Cannot retrieve market data: [Historical correlation for instruments 52598110, 35602558 is older than 2M, Historical correlation for instruments \
52598110, 35602558 is older than 2M, Historical correlation for instruments 52598110, 35602558 is older than 2M, Historical correlation for instruments 52598110, \
35602558 is older than 2M, Error while loading Structured Product market data \
Failure: Cannot retrieve market data: [Historical correlation for instruments 48021088, 1029755 is older than 2M, Historical correlation for instruments 48021088, 1029755 \
is older than 2M, Error while loading Structured Product market data \
Failure: Cannot retrieve market data: [Historical correlation for instruments 612292, 52598110 is older than 2M, Historical correlation for instruments 612292, 52598110 is \
older than 2M, Historical correlation for instruments 612292, 52598110 is older than 2M, Historical correlation for instruments 612292, 52598110 is older than 2M, \
Error while loading Structured Product market data \
Failure: Cannot retrieve market data: [Historical correlation for instruments 489459, 104322960 is older than 2M, Historical correlation for instruments 489459, \
104322960 is older than 2M, Historical correlation for instruments 489459, 104322960 is older than 2M, Historical correlation for instruments 489459, \
104322960 is older than 2M, Error while loading Structured Product market data')
res = list(map(int,match))
x = res
# print(str(x))
unique_numbers = list(set(x))
print(np.transpose(unique_numbers))
Naina
  • 127
  • 9

2 Answers2

2

Strings delimited by ' ' or " " are supposed to be on a single line. They can be made into multiple lines by either using \ at the end of each line or delimiting them by ''' ''' or """ """.

Regarding your regex, I see that you use \d+ to indicate at least one digit. You can change that to at least n digits by using \d{n,}.

Alexandre Marcq
  • 364
  • 4
  • 14
1

This should do the trick:

import re

regex = (r'(?<=\s)(\d+)(?=\s|,)') # positive and negative look assertions

# note use of triple quotes to negate use of \
match = re.findall(regex, """Failure: Cannot retrieve market data: [Historical correlation for instruments 48021088, 1029755
is older than 2M, Historical correlation for instruments 48021088, 1029755 is older than 2M, Error while loading Structured Product market data
Failure: Cannot retrieve market data: [Historical correlation for instruments 52598110, 35602558 is older than 2M, Historical correlation for instruments
52598110, 35602558 is older than 2M, Historical correlation for instruments 52598110, 35602558 is older than 2M, Historical correlation for instruments 52598110,
35602558 is older than 2M, Error while loading Structured Product market data
Failure: Cannot retrieve market data: [Historical correlation for instruments 48021088, 1029755 is older than 2M, Historical correlation for instruments 48021088, 1029755
is older than 2M, Error while loading Structured Product market data
Failure: Cannot retrieve market data: [Historical correlation for instruments 612292, 52598110 is older than 2M, Historical correlation for instruments 612292, 52598110 is
older than 2M, Historical correlation for instruments 612292, 52598110 is older than 2M, Historical correlation for instruments 612292, 52598110 is older than 2M,
Error while loading Structured Product market data
Failure: Cannot retrieve market data: [Historical correlation for instruments 489459, 104322960 is older than 2M, Historical correlation for instruments 489459,
104322960 is older than 2M, Historical correlation for instruments 489459, 104322960 is older than 2M, Historical correlation for instruments 489459,
104322960 is older than 2M, Error while loading Structured Product market data""")

unique_numbers = list(set(match))
print(*unique_numbers)  # print as required

Explanation

From the Python Regex docs

(?<=...) Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion.

(?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion.

Together these make up a look around assertion. We then have <look behind>(\d+)<look ahead>. The () indicates this is the matching group - what is returned from the findall call. Then \d+ is as before - more than one digit.

ChrisOram
  • 1,254
  • 1
  • 5
  • 17