1

I am new to python and I tried to read a PDF map using (PyPDF2). I am getting this as output Sample output of pdf map as text. I want to extract the data from this output where a certain pattern matches [RIY-DIRAHH-015524.49121946.651068]. I need only those values which only have this pattern. The length of the pattern varies [30-34]. Fixed for [RIY-DIRAHH-0155] while change only happens in LAT/LONG [24.49121946.651068]. Complete output of PDF MAP.

Please help to extract/split the specific values (matches pattern). If there is any other solution available for reading PDF MAP. kindly advise. Thanks in advance.

import re
x='result of PDF map'[image1]
result = re.search('\w{3}-\w{6}-\d*.\d*.\d*',x)
#output
['', '', '', '', '']

Sample Image of map.

Imtiaz Ali
  • 49
  • 7

1 Answers1

1

Here's a regex that would work for you:

re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{8}\.\d{5,7}", text)

The result is:

 'RIY-OUHOMH-100224.53476846.650127',
 'RIY-OUHOMH-100324.53282546.65039',
 'RIY-OUHOMH-100424.53224446.651758',
 'RIY-OUHOHH-100724.52902946.653571',
 'RIY-OUHOHH-100624.53007146.651934',
 'RIY-OUHOHH-100524.53178646.65279',
 'RIY-OUHOMH-100124.53597246.649456',
 'RIY-DIRAHH-015124.49540746.641877',
 'RIY-DIRAHH-015224.49410546.644253',
 'RIY-DIRAHH-015324.49267846.646789',
 'RIY-DIRAHH-015424.49144946.649107',
 'RIY-DIRAHH-015524.49121946.651068',
 'RIY-DIRAHH-015624.49343446.652505',
 'RIY-DIRAHH-015724.49563146.653924',
 ...

edit

To separate this into several columns, the entire code would be:

out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text) 
df = pd.DataFrame(out, columns = ["RIY"]) 

df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]
df["col3"] = df.RIY.str[24:]

df would then look like:

                                 RIY             col1       col2       col3
0  RIY-OUHOMH-100224.53476846.650127  RIY-OUHOMH-1002  24.534768  46.650127
1   RIY-OUHOMH-100324.53282546.65039  RIY-OUHOMH-1003  24.532825   46.65039
2  RIY-OUHOMH-100424.53224446.651758  RIY-OUHOMH-1004  24.532244  46.651758
3  RIY-OUHOHH-100724.52902946.653571  RIY-OUHOHH-1007  24.529029  46.653571
4  RIY-OUHOHH-100624.53007146.651934  RIY-OUHOHH-1006  24.530071  46.651934
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • Bundle of Thanks, Dear. is is posible to separate This 'RIY-OUHOMH-100124.53597246.649456' to 'RIY-OUHOMH-1001' '24.535972' '46.649456' – Imtiaz Ali Jun 17 '20 at 07:32
  • what do you mean by separating this? Break it into a tuple of (1) RIY, (2) OUHOMH, (3) 100124, etc? – Roy2012 Jun 17 '20 at 07:34
  • Separating means for each element in output I want to split into 3 columns From this RIY-DIRAHH-015724.49563146.653924 to Col1 (RIY-DIRAHH-0157) Col2 (24.495631) Col3 (46.653924) – Imtiaz Ali Jun 17 '20 at 07:40
  • sure. If the text of a given item is 't', use: col1 = t[0:15] col2 = t[15:24] col3 = t[24:] – Roy2012 Jun 17 '20 at 07:59
  • I am trying this to achieve From this RIY-DIRAHH-015724.49563146.653924 to Col1 (RIY-DIRAHH-0157) Col2 (24.495631) Col3 (46.653924) `import re` `out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)` `import pandas as pd` `df=pd.DataFrame(out)` `df.set_axis(['Raw_Data'], axis=1, inplace=False)` `df.extract('("RIY-[A-Z]{6}") - ("\d{6}\.\d{6,8}\.\d{5,7}"+)', expand=True)` Error: AttributeError: 'DataFrame' object has no attribute 'extract' – Imtiaz Ali Jun 17 '20 at 08:27
  • Thanks Again. This is happening for some rows for this data set. I think fix length split will only work perfectly for a fixed range. For the first part, [0:15] length will not change but for the rest of the part, length varies. I want to try (find second dot(.) -2 (to avoid removing 46) then split. I spent some time searching but did not found. ` ".".join(aa.split(".", 2)[:2]) ` [Link](https://pastebin.com/4QeJiFW9) – Imtiaz Ali Jun 17 '20 at 16:51
  • Do you mind opening a new question about splitting the RIY column? SO doesn't encourage this kind of discussion that isn't part of the original question. Please tag me in a comment, and I promise to have a look ASAP. okay? – Roy2012 Jun 17 '20 at 16:54