python split long output string based on pattern (no delimiter)

Question

I am new to python and I tried to read a PDF map using (PyPDF2). I am getting this as output Sample output of pdf map as text. I want to extract the data from this output where a certain pattern matches [RIY-DIRAHH-015524.49121946.651068]. I need only those values which only have this pattern. The length of the pattern varies [30-34]. Fixed for [RIY-DIRAHH-0155] while change only happens in LAT/LONG [24.49121946.651068]. Complete output of PDF MAP.

Please help to extract/split the specific values (matches pattern). If there is any other solution available for reading PDF MAP. kindly advise. Thanks in advance.

import re
x='result of PDF map'[image1]
result = re.search('\w{3}-\w{6}-\d*.\d*.\d*',x)
#output
['', '', '', '', '']

Sample Image of map.

Could you please add sample input (and expected output) as text? — Roy2012, Jun 17 '20 at 07:09
https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file — Preston Hager, Jun 17 '20 at 07:14
Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) — Preston Hager, Jun 17 '20 at 07:14
Please have a look, [Link](https://pastebin.com/xMg2wNEK). I want to extract only this, starting and ending text need to remove. — Imtiaz Ali, Jun 17 '20 at 07:23

Roy2012 · Accepted Answer · 2020-06-17T09:17:54.517

1

Here's a regex that would work for you:

re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{8}\.\d{5,7}", text)

The result is:

 'RIY-OUHOMH-100224.53476846.650127',
 'RIY-OUHOMH-100324.53282546.65039',
 'RIY-OUHOMH-100424.53224446.651758',
 'RIY-OUHOHH-100724.52902946.653571',
 'RIY-OUHOHH-100624.53007146.651934',
 'RIY-OUHOHH-100524.53178646.65279',
 'RIY-OUHOMH-100124.53597246.649456',
 'RIY-DIRAHH-015124.49540746.641877',
 'RIY-DIRAHH-015224.49410546.644253',
 'RIY-DIRAHH-015324.49267846.646789',
 'RIY-DIRAHH-015424.49144946.649107',
 'RIY-DIRAHH-015524.49121946.651068',
 'RIY-DIRAHH-015624.49343446.652505',
 'RIY-DIRAHH-015724.49563146.653924',
 ...

edit

To separate this into several columns, the entire code would be:

out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text) 
df = pd.DataFrame(out, columns = ["RIY"]) 

df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]
df["col3"] = df.RIY.str[24:]

df would then look like:

                                 RIY             col1       col2       col3
0  RIY-OUHOMH-100224.53476846.650127  RIY-OUHOMH-1002  24.534768  46.650127
1   RIY-OUHOMH-100324.53282546.65039  RIY-OUHOMH-1003  24.532825   46.65039
2  RIY-OUHOMH-100424.53224446.651758  RIY-OUHOMH-1004  24.532244  46.651758
3  RIY-OUHOHH-100724.52902946.653571  RIY-OUHOHH-1007  24.529029  46.653571
4  RIY-OUHOHH-100624.53007146.651934  RIY-OUHOHH-1006  24.530071  46.651934

edited Jun 17 '20 at 09:17

answered Jun 17 '20 at 07:27

Roy2012

11,755
2
22
35

Bundle of Thanks, Dear. is is posible to separate This 'RIY-OUHOMH-100124.53597246.649456' to 'RIY-OUHOMH-1001' '24.535972' '46.649456' – Imtiaz Ali Jun 17 '20 at 07:32
what do you mean by separating this? Break it into a tuple of (1) RIY, (2) OUHOMH, (3) 100124, etc? – Roy2012 Jun 17 '20 at 07:34
Separating means for each element in output I want to split into 3 columns From this RIY-DIRAHH-015724.49563146.653924 to Col1 (RIY-DIRAHH-0157) Col2 (24.495631) Col3 (46.653924) – Imtiaz Ali Jun 17 '20 at 07:40
sure. If the text of a given item is 't', use: col1 = t[0:15] col2 = t[15:24] col3 = t[24:] – Roy2012 Jun 17 '20 at 07:59
I am trying this to achieve From this RIY-DIRAHH-015724.49563146.653924 to Col1 (RIY-DIRAHH-0157) Col2 (24.495631) Col3 (46.653924) `import re` `out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)` `import pandas as pd` `df=pd.DataFrame(out)` `df.set_axis(['Raw_Data'], axis=1, inplace=False)` `df.extract('("RIY-[A-Z]{6}") - ("\d{6}\.\d{6,8}\.\d{5,7}"+)', expand=True)` Error: AttributeError: 'DataFrame' object has no attribute 'extract' – Imtiaz Ali Jun 17 '20 at 08:27
Thanks Again. This is happening for some rows for this data set. I think fix length split will only work perfectly for a fixed range. For the first part, [0:15] length will not change but for the rest of the part, length varies. I want to try (find second dot(.) -2 (to avoid removing 46) then split. I spent some time searching but did not found. ` ".".join(aa.split(".", 2)[:2]) ` [Link](https://pastebin.com/4QeJiFW9) – Imtiaz Ali Jun 17 '20 at 16:51
Do you mind opening a new question about splitting the RIY column? SO doesn't encourage this kind of discussion that isn't part of the original question. Please tag me in a comment, and I promise to have a look ASAP. okay? – Roy2012 Jun 17 '20 at 16:54

python split long output string based on pattern (no delimiter)

1 Answers1

edit