0

Trying to prefill a form from pdf data over here, and there's some part of regex match objects and dictionaries I'm stuck on.

"abbreviated" code:

import PyPDF2, regex, urllib.parse, webbrowser
#using regex instead of re as I was nesting lookarounds, but might not need to do this anymore.

### Define the field ids with sensible names

entering_email = 'field45865550'
uid_number = 'field45865570'
fname = 'field45865574-first'
lname = 'field45865574-last'
add_1 = 'field45865578-address'
city = 'field45865578-city'
state = 'field45865578-state'
zip = 'field45865578-zip'

skipping over the Py2PDF part, as this seems fine. please forgive my naming conventions.

### Open text file, search for field contents, and define them
with open (pdffile+'-text.txt', 'r') as text_file:
    text = text_file.read() 
    entering_email_value = regex.search(r'(?<=Email:\|)(.*?)(?=\|)(?=.*\|Manager Information:)', text) or ["---"]
    uid_number_value = regex.search(r'(?<=UID Number:\|)(.*?)(?=\|)', text) or ["---"]
    fname_value = regex.search(r'(?<=First Name:\|)(.*?)(?=\|)', text) or ["---"]
    lname_value = regex.search(r'(?<=Last Name:\|)(.*?)(?=\|)', text) or ["---"]
    add_1_value = regex.search(r'(?<=Last Name:\|.*)(?<=Address:\|)(.*?)(?=\|)(?=.*Employee Information:)', text) or ["---"]
    city_value = regex.search(r'(?<=Last Name:\|.*)(?<=City & State:\|)(.*?)(?=,)(?=.*Employee Information:)', text) or ["---"]
    state_value = regex.search(r'(?<=Last Name:\|.*)(?<=, )(.*?)(?= )(?=.*Employee Information:)', text) or ["---"]
    zip_value = regex.search(r'(?<=Last Name:\|.*)(?<=[A-Z][A-Z] )(.*?)(?=\|)(?=.*Employee Information:)', text) or ["---"]

getVars = {entering_email: entering_email_value.group(), 
            uid_number: uid_number_value.group(),
            email: email_value.group(),
            fname: iw_fname_value.group(),
            lname: iw_lname_value.group(),
            city: city_value.group(),
            state: state_value.group(),
            zip: zip_value.group()
            }

webbrowser.open(url + urllib.parse.urlencode(getVars), new=0, autoraise=True)

The regex syntax might look weird, but works fine- I'm replacing "\n" with "|" because I didn't know about the DOTALL flag. My issue is that it looks like the OR statements appended to the regex.searches is being ignored. The source files will be missing info regularly, so default filling it with "" or "---" where there's no match is what I'm looking to do. I'm currently looking into list comprehensions to do this.

Basically, my question is am I doing this "right-ish"? I'm sure it's hacky.

My other question is one I'm trying to run down now, is a list comprehension the right answer for replacing none with "" ? - and any help with the structure & syntax - it seems like I should be able to roll it into the dictionary declaration?

Matt
  • 1
  • 2
  • 2
    Please explain your question as clearly as possible. It also helps to ask specific questions. – troymyname00 Feb 23 '20 at 04:31
  • Why are you escaping | ? Your regex expressions look strange to me. I'm not sure they're doing exactly what you want them to do. – Todd Feb 23 '20 at 04:35
  • For instance, taking one at random `(?=\|)` is saying that you want text preceding this to match if followed by this.. and what this will match is a '|' symbol. – Todd Feb 23 '20 at 04:40
  • Sorry I wasn't clear - I think the OR statements are getting ignored, as when there is no match, it throws an exception error "'list' object has no attribute 'group'" - I'd like to be able to just assign "" or " ". – Matt Feb 23 '20 at 18:38
  • I'm escaping | because I replaced \n with | before I knew about using the DOTALL flag - so it's sort of a delimiter now. There's a huge learning curve happening over here. – Matt Feb 23 '20 at 18:41
  • One more item to note - the Py2PDF textExtract gets the job done - but the text is out-of-order (consistently, at least)- meaning the source data is always in the same weird order when converted, which might explain why I'm bounding(?) match parameters that don't make much sense contextually- i.e. between "Last Name:" and "Employee Information:" – Matt Feb 23 '20 at 18:52

1 Answers1

0

Of course this has been answered before:

return string with first match Regex

Looks like it was just a matter of embedding the '' default in my statements by adding "|$"

Matt
  • 1
  • 2