Let's say I have successfully get this text and then I assign them with name textToModify:
textToModify = "
abcde abcde
Title: Director, lorem company
Phone: 123.647.4555
Mobile: 123.123.1234 E-mail: try1@umich.edu Assistant: my name Assistant Phone: 667.889.9910
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Linkedin: www.linkedin.com/in/lorem-ipsum/
Twitter: www.twitter.com/ipsum
"
Now I want to extract title, name, phone number, linkedin, twitter and other important info from this text. Is there such a library to do so or do you have any idea to do so? Assuming that the formatting of this text is random, but the word title will always be next to the title itself, the word phone will always be next to the phone, etc.
My initial thoughts:
nltk
library won't work because it basically assigns words with identifier, the problem is, this text is not separated per words, but chars, if you access textToModify[20] for example, it will just return a character.
My other thought is, what if I access links and then take a screenshot of them and then using (if exists) picture to text library in python, and then go from there
Thank you!