0

I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.

For example, one row of the text is:

Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons learnt from the RUPES project12 Payment for environmental service and it potential and example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32

The first word in the above text should be 'preface' instead of 'preface2' and so on.

line = re.sub(r"[A-Za-z]+(\d+)", "", line)

This, however removes the words as well as seen:

Pes Lessons learnt from the RUPES Payment for environmental service and it potential and example in Chapter Integrating payment for ecosystem service into Vietnams policy and Chapter Creating incentive for Tri An watershed Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong district of Hoa Binh province Chapter 5 Local revenue sharing Nha Trang Bay Marine Protected Area Synthesis and

How can I capture only the numbers that immediately follow words?

BoJack Horseman
  • 4,406
  • 13
  • 38
  • 70

5 Answers5

1

You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:

re.sub(r'(?<=\w+)\d+\b', '', line)

Hope this helps

EDIT: Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version

re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)

to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version

re.sub(r'(?<![\d\s])\d+\b', '', line)

to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.

Schorsch
  • 171
  • 9
1

You can capture the text part and substitute the word with that captured part. It simply writes:

re.sub(r"([A-Za-z]+)\d+", r"\1", line)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

Try this:

line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number    
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one    
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one

\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?

ESCE
  • 123
  • 1
  • 6
0

below, I'm proposing a working sample of code that might solve your problem.

Here's the snippet:

import re

# I'will write a function that take the test data as input and return the
# desired result as stated in your question.

def transform(data):
    """Replace in a text data words ending with number.""""
    # first, lest construct a pattern matching those words we're looking for
    pattern1 = r"([A-Za-z]+\d+)"

    # Lest construct another pattern that will replace the previous in the final
    # output.
    pattern2 = r"\d+$"

    # Let find all matching words
    matches = re.findall(pattern1, data)

    # Let construct a list of replacement for each word
    replacements = []
    for match in matches:
        replacements.append(pattern2, '', match)

    # Intermediate variable to construct tuple of (word, replacement) for
    # use in string method 'replace'
    changers = zip(matches, replacements)

    # We now recursively change every appropriate word matched.
    output = data
    for changer in changers:
        output.replace(*changer)

    # The work is done, we can return the result
    return output

For test purpose, we run the above function with your test data:

data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons     
learnt from the RUPES project12 Payment for environmental service and it potential and 
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams 
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter 
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao 
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang 
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""

result = transform(data)

print(result)

And the result looks like this:

Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from 
the RUPES project Payment for environmental service and it potential and example in 
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and 
programmes Chapter Creating incentive for Tri An watershed protection Chapter 
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building 
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong 
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay 
Marine Protected Area Vietnam Synthesis and Recommendations References
eapetcho
  • 527
  • 3
  • 10
-1

You can create a range of numbers as well:

re.sub(r"[0-9]", "", line)
hqkhan
  • 463
  • 2
  • 10