1

I have a txt file containing one sentence per line, and there are lines containing numbers attached to letters. For instance:

The boy3 was strolling on the beach while four seagulls appeared flying.
There were 3 women sunbathing as well.
All children were playing happily.

I would like remove lines like the first one (i.e. having numbers stuck to words) but not lines like the second which are properly written.

Has anybody got a slight idea?

  • 1
    You can start by separating the string into words using the `split` method. Then you can use a loop to check if the `word` is a number by using `isdigit()` method, if is a number, then you can ignore it, if not you will need to check if the word has any numbers by entering to a second loop – EnriqueBet May 29 '22 at 10:31
  • 2
    There is probably a smart way to check this by using regex, but you might need to dig deeper into that – EnriqueBet May 29 '22 at 10:32
  • I have made this regex `[A-ZÁÉÍÓÚÜÑa-záéíóúúñ][0-9]+|[0-9]+[A-ZÁÉÍÓÚÜÑa-záéíóúúñ]` but it kind of ugly haha. Note it must be useful for Spanish, this is why I introduce accents and ñ. – Javier Saldaña Martínez May 29 '22 at 10:39
  • 1
    This might help: https://stackoverflow.com/questions/6314614/match-any-unicode-letter Please post an answer once you find a solution you're happy with. – Joooeey May 29 '22 at 10:44
  • Hello @Joooeey. I will finally be using the one I posted. It works although it is ugly!! – Javier Saldaña Martínez May 29 '22 at 10:58
  • Do you at least have the code to read a text file and split into lines, or remove lines? Please provide a [example] and do some research here, e.g. [`[python] words containing numbers`](https://stackoverflow.com/search?q=%5Bpython%5D%20words%20containing%20numbers) – hc_dev May 29 '22 at 11:43

2 Answers2

1

You can use a simple regex pattern. We start with [0-9]+. This pattern detects any number 0-9 an indefinite amounts of times. Meaning 6, or 56, or 56790 works. If you want to detect sentences that have numbers attached to a string you could use something like this: ([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z]) This regex string matches a string with a letter before a number or after a number. You can search strings using:

import re

lines = [
    'The boy3 was strolling on the beach while 4 seagulls appeared flying.',
    'There were 3 women sunbathing as well.',
]

for line in lines:
    res = re.search("([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z])", line)
    if res is None:
        # remove line

However you can add more characters to the allowed letters if your sentences can include special characters and such.

Viktor VN
  • 235
  • 2
  • 7
  • 1
    He want to remove string that matches so condition should be `if res is not None` or just `if re.search("([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z])", line):` – azro May 29 '22 at 11:35
  • 1
    Also could be this `re.search(r"([a-z]\d)|(\d[a-z])", line, flags=re.IGNORECASE)` – azro May 29 '22 at 11:35
0

Suppose, your input text is stored in file in.txt, you can use following code:

import re

with open("in.txt", "r") as f:
    for line in f:
        if not(re.search(r'(?!\d)[\w]\d|\d(?!\d)[\w]', line, flags=re.UNICODE)):
               print(line, end="")

The pattern (?!\d)[\w] looks for word characters (\w) excluding digits. The idea is stolen from https://stackoverflow.com/a/12349464/2740367

leu
  • 2,051
  • 2
  • 12
  • 25