0

I have been trying to extract the particular substring from multiple lines in Python.

The string goes for 400 lines...with foreign characters as well(for instance Chinese)

The example is my_string = '''\n1. Up in the air: 悬而未决\n2. Out of the woods: 摆脱困境\n3. Not all there: 智商掉线\n''' all the way to 400. Born to the purple: 出身显赫.

What I want to do: Extract only the English part and put them in a list

[Up in the air, Out of the woods, Not all there]

Here is my way of doing it.

import re
my_list = re.split('\:.*\n',my_string[1:])

for line in my_list[-1]:
olist = re.sub('\d.','',line)
print (olist) 

Is that possible to do this in one line?

Thank you

JamesNEW
  • 117
  • 7
  • Remember to use a raw string for regular expressions: https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it – Barmar Dec 16 '21 at 17:18
  • Is it safe to assume that each line starts with English/ascii chars up to the colon `:` and then Chinese? – buran Dec 16 '21 at 17:19

3 Answers3

2
" ".join(re.findall(r'[a-zA-Z]+', my_string))

> 'Up in the air Out of the woods Not all there'
cazman
  • 1,452
  • 1
  • 4
  • 11
1

If you wanted 3 elements in your list (or 400 with your full input):

re.findall(r"\d\. (.*):", my_string)

Gives:

['Up in the air', 'Out of the woods', 'Not all there']
Chris J
  • 1,375
  • 8
  • 20
0

if you want the English word in a list you can do something like this :

import re
line='''\n1. Up in the air: 悬而未决\n2. Out of the woods: 摆脱困境\n3. Not all there: 智商掉线\n'''
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
eng_list=[]
for word in words:
    eng_list.append(word)
print(eng_list)

OUTPUT:

['Up', 'in', 'the', 'air', 'Out', 'of', 'the', 'woods', 'Not', 'all', 'there']

Else if you want the eng word in a single string and in list than you can go for this :

import re
line='''\n1. Up in the air: 悬而未决\n2. Out of the woods: 摆脱困境\n3. Not all there: 智商掉线\n'''
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
eng_list=[]
letter=''
for word in words:
    letter+=word+' '
    # eng_list.append(word)
eng_list.append(letter.strip())
print(eng_list)

OUTPUT

['Up in the air Out of the woods Not all there']
loopassembly
  • 2,653
  • 1
  • 15
  • 22