0

I am attempting to return user location data while scrapping twitter. I am having trouble with the regex, specifically, I wish to exclude "\n" from the output.

Current regex:

data = open("user_locations.txt", "r")
valid_ex = re.compile(r'([A-Z][a-z]+), ([A-Za-z]+[^\n])')

user_locations.txt:

California, USA
You are your own ExclusiveLogo
Around The World
Galatasaray
★DM 4 PROMO / CONTENT REMOVAL★
Glasgow, Scotland
United States
Berlin, Germany
Global

Expected output:

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']

Actual output:

['California, USA\n', 'Glasgow, Scotland\n', 'Berlin, Germany\n']

An alternate reason for the discrepancy between expected vs actual output, may be the way in which I am using search() in printing the list. That is:

for line in data:
    result = valid_ex.search(line)
    if result:
        locations_list.append(line)
    print(locations_list)

Thank you, any help would be greatly appreciated! :)

Darcy
  • 575
  • 2
  • 8
  • 21
  • 1
    "\n" isn't a part of a regex match unless you do a multiline search with "DOTALL". The \n wasn't in the regex match, but it is in the original line and that's what you saved. You could do `line.strip()`. – tdelaney May 12 '18 at 00:35
  • You don't need a regex, and this is just a generic thing when reading input from file. – smci May 13 '18 at 09:46
  • Curious which other answers you saw that weren't a solution? SO is full of variants of this question going back a decade. If anything there are too many duplicates and we need to close some in favor of others. – smci May 13 '18 at 09:56

3 Answers3

1

When you find a match, you call locations_list.append(line). This appends the entire line (including the newline character), not just what was matched.

Here are a couple options to get your desired result:

Option 1

Change locations_list.append(line) to locations_list.append(line.strip())

Option 2

Grab the result of the desired match instead:

with open('test.txt') as f:
    print(re.findall(r'[A-Z][a-z]+, [A-Za-z]+', f.read()))

Output:

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']
user3483203
  • 50,081
  • 9
  • 65
  • 94
0

Have you considered using str.strip() to remove the trailing newlines?

Shamus
  • 96
  • 4
0

A simple solution would be replace all contiguous whitespace characters with a single space.

text = re.sub(r'\s+', ' ', text) 
James
  • 32,991
  • 4
  • 47
  • 70