Extracting sentences from a xml file using script in linux

Asked Oct 30 '20 at 12:49

Active Oct 30 '20 at 13:11

Viewed 56 times

I am trying to make a script named letter.sh to unzip a word-file, extract the text and images, and save them in a directory. I think I managed to unzip and extract the images, but I am struggling with extracting the text from the document.xml file.

The sentences I want to extract are in this formatet:

<w:t>text</w:t>

I have tried using grep, but it doesn't work.

grep "<w:t>*</w:t>" ~word/document.xml < touch letter.txt

I would appreciate it if anyone could guide me onto the right path.

Thank you.

edited Oct 30 '20 at 13:11

Cyrus

84,225
14
89
153

asked Oct 30 '20 at 12:49

robe320

grep isn't a good choice to parse xml, but in your attempt, you were missing `.` before `*` (also, use single quotes instead of double quotes - see https://mywiki.wooledge.org/Quotes) – Sundeep Oct 30 '20 at 12:51
1

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Oct 30 '20 at 13:12

Extracting sentences from a xml file using script in linux

0 Answers0