How to extract words from bytes using regex in python?

Question

I've a bytes:

b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?'

Using regex, I want to extract words(in this case "Movies", "Movies" and "TV Series")

What I tried:

Extract word from string Using python regex

Extracting words from a string, removing punctuation and returning a list with separated words

Python regex for finding all words in a string

It is not clear what you are doing and why you expect just `Movies` and `TV Series`. Please show your code and explain what does not work. — Wiktor Stribiżew, Jul 01 '20 at 09:12

score 0 · Accepted Answer · answered Jul 01 '20 at 09:25

Usually you would convert bytes into a string using the .decode() method. However, your bytes contain values that are not ASCII or UTF-8.

My suggestion is to go through each byte and try interpreting it as an ASCII value

raw= b'\n\x1b\t\xff\xff\xff\x7f@^\x8a?\x11\x00\x00\x00@\xe8HL\xbf\x19\x00\x00\x00\x00\x95\xb0\xd9?\x127\r\xc9\xd5"=\x15\xc9\xd5"=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07Bollard0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?' b'\n\x1b\t\x01\x00\x00\x00\xa4\x9b\xb0\xbf\x11\x01\x00\x00\xc0/\xe3\x90?\x19\x01\x00\x00\xa0U\xc4\xef?\x127\r|\x934=\x15|\x934=\x1a+\x1a)\n\x1e\x12\x1c\n\x0fMovies"\x07TV Series0\x01\x11\x00\x00\x00\x00\x00\x00\xf0?'
string = ""
for b in raw:
    string += chr(b)
print(string)

After that, you can use a Regex approach to find words. It's usually a good idea to define a minimum length for a word.

import re
for word in re.split('\W', string):
    if len(word) > 3:
        print(word)

That will give you:

Movies
Bollard0
Movies
Series0

You have not mentioned "Bollard0", but I assume that was a mistake.

If you want spaces to be part of your string, you'll need to adapt the Regex. \W splits on word boundaries and Space is considered a boundary.

How to extract words from bytes using regex in python?

1 Answers1