Extracting data from text (in CSV file) with Python

Question

I'm helping a dog rescue group analyzing their incoming application for adoption. All the applications come in through an online system, and each application is given an automatically generated Form ID. The applications will then be assigned to different volunteers to process.

Most of the information is straightforward and I can use pandas to process with ease. Part of the analysis is about the processing time for each application. That means from the date the application is created (form status as "Submitted") to a dog is adopted (form status as "Adopted"). When I exported the form data, the status changes and the general comments are mixed under one header called "Comments (inline)."

Here is an abridged example of what the status changes/comments look like for one application.

The text follows some basic patterns.

General comment: CURRENT_PERSON(first name) wrote on DATE: text

Change in volunteers: CURRENT_PERSON(first name) wrote on DATE: Assigned form to NEW_PERSON(first last)

Form status change: CURRENT_PERSON(first name) wrote on DATE: Changed Status from CURRENT_STATUS to NEW_STATUS

I'm new to python (~3mos). The first thing that came to mind is using python and regular expression to parse the text, extract the data, and separate them into two groups (one for general comments and one for status changes). This will take some time since I'm still learning but it seems doable. And the end result will be something like this.

The same status can be assigned multiple times so I'd need to give them a number. Then I can calculate the processing time as the days between Adopted-01 and Submitted-01.

However, from what I've learned so far, it seems that using this type of iteration (for loop) is slow and not recommended. I have about 1500 forms so far and the number will only go up.

Should I go ahead with python and re? or is there a better way to get what I described? I strongly believe there is a much better way I just don't know enough yet.

Suggestions are greatly appreciated.

UPDATE

Sample data here in CSV file, here in XML file

You'll find four names in there: Jenny White, Rose Burk, Kerry James, Henry Woods.

I realize that the text CSV file is not as clean as I thought it would be. There is no space in between the comments or the status change. Even though there are patterns, it's not always consistent(first name only vs full name). I updated the patterns. With exporting into CSV, you'll see the following

Henry wrote on 9/22/2020: Assigned form to Rose Burk Rose wrote on 9/22/2020: Sent intro email.

became

Henry wrote on 9/22/2020: Assigned form to Rose BurkRose wrote on 9/22/2020: Sent intro email.

Now I need to look at cleaning up the data much more as well. Thanks for taking a look.

Could you provide an example of 2-3 jobs worth of comments/status changes so we can best assist with the issue. - Note: In text and not images! :) — PacketLoss, Sep 29 '20 at 00:21
please check out: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — David Erickson, Sep 29 '20 at 00:31
Hi all, please see the link for the csv file. I can also export into XLS or XML. I was working in Jupyter, CSV seems like an easy format to work with. — fuzzcats, Sep 29 '20 at 04:01
Is the XML format any better? That might make it easier to identify/parse some of the data. Is it possible to make any changes upstream to improve the format? — AMC, Sep 30 '20 at 22:31
I think I'm in way over my head! Maybe there is a reason why the status change is not automatically tracked in the system because it is messy. :P The XML looks nicer as each entry is separated by an empty line (when I opened it with Atom). https://www.dropbox.com/s/sufwvhl7lifhlob/stack_samples0928.xml?dl=0 — fuzzcats, Oct 01 '20 at 17:06

score 1 · Accepted Answer · answered Oct 06 '20 at 21:04

So I figured out how to use the regular expressions to process the comments based on the patterns.

This following one can grab all different dates and all statuses based on the fact that they all have the section "Changed Status from" and I can match the individual group to whether that's an old status or a new status.

regex = r"(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)\sto\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)"

And this one can grab just the date I need if I only focus on Adopted.

regex = r"(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)\sto\sAdopted(?!\sElsewhere)

Also, I do not need to use the for loop. In the particular case focusing on the date for Adopted, I can just use the following to add the adopted date to my dataframe in Jupyter notebook.

df['Adopted']=df['Comments (inline)'].str.extract(r'(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s[A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?\sto\sAdopted(?!\sElsewhere)')

This also means that I can just use the original CSV file.

Extracting data from text (in CSV file) with Python

UPDATE

1 Answers1