Removing characters from docx file

Question

I have a really big docx file(700 pages) it has a log format

[15/09/2014, 15:30:21] Stijn: Nice

I am looking to remove the time and let it look like this

[15/09/2014] Stijn: Nice

Im pretty sure that this can be done in python but just haven't figured out the exact way. I should be using something like this?

line.replace(char,'')

Its a whatsapp log file little it looks like this(Some text use 2 lines)

[15/09/2014, 15:53:39] Dylan: Beste selfie ever 
[15/09/2014, 15:53:52] Sipke: Ja 
[15/09/2014, 15:54:05] ‎You changed this group's icon

Help would be apreciated :)

I am not sure how your data is set up. If what you have in square brackets is a list on its own you can use .pop() to remove the last item from each list. — Julian Silvestri, Sep 26 '18 at 19:17
I think the quickest would be to zip open the docx and change the main xml file with a pattern replace. — Anton vBR, Sep 26 '18 at 19:18
.docx files are saved in a proprietary format that is not easy to change outside of word (try opening in notepad to see what i mean), you will have a lot of trouble using python to parse them. Is there any way you can use another format or even .txt files instead? — Njord, Sep 26 '18 at 19:20
My bad, I try to avoid working with .docx and the like, maybe this package could do what youre looking for: https://github.com/python-openxml/python-docx — Njord, Sep 26 '18 at 19:28
Are you asking about how to edit each line? How about `line = line[:11] + line[22:]` or similar — Nearoo, Sep 26 '18 at 19:46
Your question is very imprecise. Please tell us what exactly you're looking for. Is it how to open a word file? Is it how to change each line to your format? What aspect do you need help with? — Nearoo, Sep 26 '18 at 19:47

score 1 · Accepted Answer · answered Sep 26 '18 at 20:06

1

If you know how to use regular expressions, this can be done very easily. You want to:

1) Read the file line by line

2) Substitute the time stamp with blank text.

Here's a sample python code I whipped up for you:

#!/usr/bin/python
import re

text = "[15/09/2014, 15:30:21] Stijn: Nice"

# Capture time stamp and substitute it with blank
new = re.sub(r'(, [0-9]{2}:[0-9]{2}:[0-9]{2})', "", text)    
print new

This will yield:

[15/09/2014] Stijn: Nice

If you want to fiddle with/understand the usage of regex expression I used here, follow this link- https://regexr.com/406sc

answered Sep 26 '18 at 20:06

The Next Programmer

83
8

Oh and also @Barkork, (depending on your data) you will have to modify the regex to make sure it doesn't perform a _greedy match_. In other words if a line item looks like: `[15/09/2014, 15:30:21] Stijn: Nice, I timed my lap, 02:01:99 and you?` the current pattern might remove ", 02:01:99" because it looked for every instance of that pattern in the given line. – The Next Programmer Sep 26 '18 at 20:14
`.docx` isn't just plain text, it's a zipped set of `.xml` files. There's more to this question than you realise. – Peter Wood Sep 27 '18 at 08:45
@PeterWood I am aware about the docx format. However, it is not that tough to export a docx to a plain text file. Since in this case the data is just a chat log, it would seem that retaining the text formatting after clean up is redundant. But I guess I should have mentioned that in the answer as well, and not assume things. Some may find this link useful as well: https://stackoverflow.com/a/35871416/4701044 – The Next Programmer Sep 27 '18 at 14:34

Removing characters from docx file

1 Answers1