-3

I have downloaded a lot of song lyrics from Genius for a project (in Python) and now need to clean them. To take an example here is a snippet of a song lyric:

'lyric = [Letra de "La Jeepeta"]\n\n[Intro: Nio García & Juanka El Problematik]\nNio García\nBrray\nJuanka\nLas Air Force son brand new\nLas moña\' verde\' como mi Sea-Doo\nUnas prendas que me\u2005cambian\u2005la actitú\'\nEsta noche\u2005no queremo\' revolú\n\n[Coro: Nio García & Juanka El Problematik]\nArrebata\'o, dando vuelta en\u2005la jeepeta (Dando vuelta en la jeepeta)\nAl la\'o mío tengo una rubia que tiene grande\' las

In the lyrics I want to:

  1. Remove square brackets and everything between them. I do that by the following:
re.sub(r"[\[].*?[\]]", "", lyric)
  1. Remove line breaks \n. I do that by the following:
re.sub(r"[\n]"," ",lyric)

But I get the problem that if there are no \n in the lyric I get an error.

  1. Remove \u. I am not sure why this appears in some songs.
re.sub(r"\[\u]", " ", lyric)

However, I get the following error: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 15-16: truncated \uXXXX escape

So first of all can you help me with the erros I'm getting? And secondly is there a way for me to have several RegEx expressions in one so I don't need to do it in several command?

Thanks in advance! :-)

andKaae
  • 173
  • 1
  • 13
  • 1
    There is no need to remove `\uXXXX`, these are legit chars. See [what your string looks like](https://ideone.com/Zf0FAj). All you need is the two first patterns. `\n` can be removed without regex, `s.replace('\n', ' ')`. So use `re.sub(r"\[[^][]*]", "", s.replace('\n', ' '))` – Wiktor Stribiżew Nov 26 '20 at 12:28
  • Or `re.sub(r"\[[^][]*]", "", re.sub(r'\n+', ' ', s))` or `re.sub(r"\s*(?:\[[^][]*]|[\r\n])+", " ", s)` – Wiktor Stribiżew Nov 26 '20 at 12:33
  • I have downloadede the songs from Genius, if you take a look at the original lyrics the signs are not a part of the lyrics. The link to this specific song is here [song](https://genius.com/Nio-garcia-brray-and-juanka-la-jeepeta-lyrics). So I think I need to remove them to do proper text analysis. – andKaae Nov 26 '20 at 12:34
  • Basically, your code without Step 3 already yields the expected results. Just check the string you get with `print(lyric)` – Wiktor Stribiżew Nov 26 '20 at 12:40
  • @WiktorStribiżew but the `\u` is not a part of the actual lyrics. So if I want to do text analysis I will have to remove this part. – andKaae Nov 26 '20 at 12:42

1 Answers1

0

The \u2005 you see in the output is a U+2005 FOUR-PER-EM SPACE (Zs) character.

You might consider a regexp to replace all Unicode whitespace with a single space instead:

re.sub("\s+", " ", lyric, flags=re.UNICODE)
AKX
  • 152,115
  • 15
  • 115
  • 172