-1

I have a really big docx file(700 pages) it has a log format

[15/09/2014, 15:30:21] Stijn: Nice

I am looking to remove the time and let it look like this

[15/09/2014] Stijn: Nice

Im pretty sure that this can be done in python but just haven't figured out the exact way. I should be using something like this?

line.replace(char,'')

Its a whatsapp log file little it looks like this(Some text use 2 lines)

[15/09/2014, 15:53:39] Dylan: Beste selfie ever 
[15/09/2014, 15:53:52] Sipke: Ja 
[15/09/2014, 15:54:05] ‎You changed this group's icon

Help would be apreciated :)

Barkork
  • 55
  • 5
  • I am not sure how your data is set up. If what you have in square brackets is a list on its own you can use .pop() to remove the last item from each list. – Julian Silvestri Sep 26 '18 at 19:17
  • I think the quickest would be to zip open the docx and change the main xml file with a pattern replace. – Anton vBR Sep 26 '18 at 19:18
  • .docx files are saved in a proprietary format that is not easy to change outside of word (try opening in notepad to see what i mean), you will have a lot of trouble using python to parse them. Is there any way you can use another format or even .txt files instead? – Njord Sep 26 '18 at 19:20
  • @Njord `.docx` is `XML`, hence the `x`. – Peter Wood Sep 26 '18 at 19:22
  • My bad, I try to avoid working with .docx and the like, maybe this package could do what youre looking for: https://github.com/python-openxml/python-docx – Njord Sep 26 '18 at 19:28
  • Are you asking about how to edit each line? How about `line = line[:11] + line[22:]` or similar – Nearoo Sep 26 '18 at 19:46
  • Your question is very imprecise. Please tell us what exactly you're looking for. Is it how to open a word file? Is it how to change each line to your format? What aspect do you need help with? – Nearoo Sep 26 '18 at 19:47

1 Answers1

1

If you know how to use regular expressions, this can be done very easily. You want to:

1) Read the file line by line

2) Substitute the time stamp with blank text.

Here's a sample python code I whipped up for you:

#!/usr/bin/python
import re

text = "[15/09/2014, 15:30:21] Stijn: Nice"

# Capture time stamp and substitute it with blank
new = re.sub(r'(, [0-9]{2}:[0-9]{2}:[0-9]{2})', "", text)    
print new

This will yield:

[15/09/2014] Stijn: Nice

If you want to fiddle with/understand the usage of regex expression I used here, follow this link- https://regexr.com/406sc

  • Oh and also @Barkork, (depending on your data) you will have to modify the regex to make sure it doesn't perform a _greedy match_. In other words if a line item looks like: `[15/09/2014, 15:30:21] Stijn: Nice, I timed my lap, 02:01:99 and you?` the current pattern might remove ", 02:01:99" because it looked for every instance of that pattern in the given line. – The Next Programmer Sep 26 '18 at 20:14
  • `.docx` isn't just plain text, it's a zipped set of `.xml` files. There's more to this question than you realise. – Peter Wood Sep 27 '18 at 08:45
  • @PeterWood I am aware about the docx format. However, it is not that tough to export a docx to a plain text file. Since in this case the data is just a chat log, it would seem that retaining the text formatting after clean up is redundant. But I guess I should have mentioned that in the answer as well, and not assume things. Some may find this link useful as well: https://stackoverflow.com/a/35871416/4701044 – The Next Programmer Sep 27 '18 at 14:34