-2

Let's say I have successfully get this text and then I assign them with name textToModify:

textToModify = "
abcde abcde
Title: Director, lorem company
                    Phone: 123.647.4555                 
Mobile: 123.123.1234                    E-mail: try1@umich.edu                  Assistant: my name                  Assistant Phone: 667.889.9910

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Linkedin: www.linkedin.com/in/lorem-ipsum/
Twitter: www.twitter.com/ipsum
"

Now I want to extract title, name, phone number, linkedin, twitter and other important info from this text. Is there such a library to do so or do you have any idea to do so? Assuming that the formatting of this text is random, but the word title will always be next to the title itself, the word phone will always be next to the phone, etc.

My initial thoughts:

nltk library won't work because it basically assigns words with identifier, the problem is, this text is not separated per words, but chars, if you access textToModify[20] for example, it will just return a character.

My other thought is, what if I access links and then take a screenshot of them and then using (if exists) picture to text library in python, and then go from there

Thank you!

Jessica Rodriguez
  • 2,899
  • 1
  • 12
  • 27
  • 1
    This sounds like an [X-Y problem](http://xyproblem.info/). Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do? – undetected Selenium Nov 15 '18 at 07:49

2 Answers2

2

If you have it in a variable, you can use python's re module to match using regex.

This SO post addresses phone numbers

This webpage shows you a step-by-step for detecting emails

For names and addresses, unless they are preceded by Name: or Address: or you can apply some logic to finding it, you may have a harder time than you previously thought. This SO post gives an example for trying to match addresses

Hope this helps. I thought about writing a full answer but the RegEx resources on SO and the rest of the web are fairly abundant

robotHamster
  • 609
  • 1
  • 7
  • 24
0

A program like this would do what you want:

finds = {}
texttoModify = texttoModify.split()
for element in enumerate(texttoModify):
    if element[1] == 'Title:':
        finds['title'] = texttoModify[element[0]+1]

but you would need to create if's for every element to get, and take the next two elements for things such as names with two words.

hhaefliger
  • 521
  • 3
  • 18