Exclude \n when reading input from file

Question

I am attempting to return user location data while scrapping twitter. I am having trouble with the regex, specifically, I wish to exclude "\n" from the output.

Current regex:

data = open("user_locations.txt", "r")
valid_ex = re.compile(r'([A-Z][a-z]+), ([A-Za-z]+[^\n])')

user_locations.txt:

California, USA
You are your own ExclusiveLogo
Around The World
Galatasaray
★DM 4 PROMO / CONTENT REMOVAL★
Glasgow, Scotland
United States
Berlin, Germany
Global

Expected output:

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']

Actual output:

['California, USA\n', 'Glasgow, Scotland\n', 'Berlin, Germany\n']

An alternate reason for the discrepancy between expected vs actual output, may be the way in which I am using search() in printing the list. That is:

for line in data:
    result = valid_ex.search(line)
    if result:
        locations_list.append(line)
    print(locations_list)

Thank you, any help would be greatly appreciated! :)

"\n" isn't a part of a regex match unless you do a multiline search with "DOTALL". The \n wasn't in the regex match, but it is in the original line and that's what you saved. You could do `line.strip()`. — tdelaney, May 12 '18 at 00:35
You don't need a regex, and this is just a generic thing when reading input from file. — smci, May 13 '18 at 09:46
Curious which other answers you saw that weren't a solution? SO is full of variants of this question going back a decade. If anything there are too many duplicates and we need to close some in favor of others. — smci, May 13 '18 at 09:56

user3483203 · Accepted Answer · 2018-05-12T00:43:47.783

When you find a match, you call locations_list.append(line). This appends the entire line (including the newline character), not just what was matched.

Here are a couple options to get your desired result:

Option 1

Change locations_list.append(line) to locations_list.append(line.strip())

Option 2

Grab the result of the desired match instead:

with open('test.txt') as f:
    print(re.findall(r'[A-Z][a-z]+, [A-Za-z]+', f.read()))

Output:

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']

score 0 · Answer 2 · answered May 12 '18 at 00:31

0

Have you considered using str.strip() to remove the trailing newlines?

answered May 12 '18 at 00:31

Shamus

96
4

score 0 · Answer 3 · answered May 12 '18 at 00:39

0

A simple solution would be replace all contiguous whitespace characters with a single space.

text = re.sub(r'\s+', ' ', text)

answered May 12 '18 at 00:39

James

32,991
4
47
70

Exclude \n when reading input from file

3 Answers3