Issue matching accented characters with Perl

Question

This code:

perl -pe 's/^(\D\w+ \w+)( word )/\1;word/gi'

doesn't work when the input has words with accented or particular characters like: á, Ș.

Precisations:

I have this code to make a count of the only artist files.

find /PATH/ -type f -exec basename "{}" + 2>/dev/null |

perl -pe 's/ - .*//g' | LC_ALL=C  sort -f | uniq -c -i|

gsed -e 's/$/;/'|

awk '{numero=$1;$1=""}{print $0,numero}'|

perl -pe 's/^(\D\w+ \w+)( & )/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+)( & >)/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( & )/\1;&/g' | 
perl -pe >'s/^(\D\w+ \w+ \w+ \w+ \w+)( & )/\1;&/g' |

perl -pe 's/^(\D\w+ \w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ >\w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Con >)/\1;Con/gi' |  
perl -pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Con )/\1;Con/gi'|

perl -pe 's/^(\D\w+ \w+)( Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+)( >Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Și )/\1;Și/gi' | 
perl >-pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Și )/\1;Și/gi'| > /PATH/File.txt

I’ve these files:

Betty Curtis & Orchestra - Song Title
Betty Curtis Con Johnny Dorelli - Song Title
Betty Curtis - Song Title
Margareta Pâslaru - Song Title
Margareta Pâslaru & Grup - Song Title
Margareta Pâslaru Și Sincron - Song Title
Matilde Sánchez - Song Title
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán - Song Title

The output desidered would be:

Betty Curtis; 3
Margareta Pâslaru; 3
Matilde Sánchez; 2

The output that comes instead is:

Betty Curtis; 3
Margareta Pâslaru; 1
Margareta Pâslaru & Grup; 1
Margareta Pâslaru Și Sincron; 1
Matilde Sánchez; 1
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán; 1

Exactly, the code is very complicated (the entire script counts nineteen lines...). The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full

eg: “Gervis Quebodeaux Rayne Serenaders” remains “Gervis Quebodeaux Rayne Serenaders;

I'd like to compact the "Perl -pe" section: (D w + w +), (D w + w + w +) etc ... is boring. But I do not know how I can do it.

I had to find a balance between summary to make the count and the need to keep as much information as possible.

I have, at the moment, 30 cases (rules) in addition to “&” I’ve “ With ” “ Con ” “ e ” “ Y ” “ Et ” “ Und “… etc in many languages of the world.

The script works fine but does not work with names where there are accented and particular letters

The script works like this:

For example, I have many files of Duke Ellington, with many different historical headers.

Duke Ellington: 2 files
Duke Ellington & Cotton Club O.: 3
Duke Ellington & His Famous O.: 7
Duke Ellington & His Famous O.;(Ft. Ben Webster): 4
Duke Ellington & His Famous O.;(Ft. Johnny Hodges): 3
Duke Ellington & His O.: 129 
Duke Ellington & His O. (ft. Ben Webster): 14
Duke Ellington & His O. (Ft. Johnny Hodges): 8
Duke Ellington & His O. (pn.): 2
Duke Ellington &His O. (v. Al Hibble): 1
Duke Ellington &His O. (v. Al Hibbler): 1
Duke Ellington &His O. (v. Herb Jeffries): 9
Duke Ellington &His O. (v. Ozzie Bailey): 1
Duke Ellington &His O. (v. Ozzie Bailey, Ray Nance Vln.): 1
Duke Ellington &His O.;(v. Ray Nance?): 1
Duke Ellington &His O.;(v.M): 1
Duke Ellington (Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (v. Dick Robertson): 1
Duke Ellington w Count Basie: 3
Duke Ellington w Gerald Wilson: 13
Duke Ellington’s Spacemen: 1
Duke Ellington’s Washingtonians: 1

Through the work of the script that produces this file

Duke Ellington; 2
Duke Ellington;&Cotton Club O.; 3
Duke Ellington;&His Famous O.; 7
Duke Ellington;&His Famous O.;(Ft. Ben Webster); 4
Duke Ellington;&His Famous O.;(Ft. Johnny Hodges); 3
Duke Ellington;&His O.; 129
Duke Ellington;&His O.;(ft. Ben Webster); 14
Duke Ellington;&His O.;(Ft. Johnny Hodges); 8
Duke Ellington;&His O.;(pn.); 2
Duke Ellington;&His O.;(v. Al Hibble); 1
Duke Ellington;&His O.;(v. Al Hibbler); 1
Duke Ellington;&His O.;(v. Herb Jeffries); 9
Duke Ellington;&His O.;(v. Ozzie Bailey); 1
Duke Ellington;&His O.;(v. Ozzie Bailey, Ray Nance Vln.); 1
Duke Ellington;&His O.;(v. Ray Nance?); 1
Duke Ellington;&His O.;(v.M); 1
Duke Ellington;(Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(v. Dick Robertson); 1
Duke Ellington;w Count Basie; 3
Duke Ellington;w Gerald Wilson; 13
Duke Ellington; Spacemen; 1
Duke Ellington; Washingtonians; 1

This is the output:

Duke Ellington: 208

Code complete: https://www.sendspace.com/file/dlep9q

What do you mean by "_doesn't work_"? If that line is your complete code it's not set up to do unicode at all. Also, how do you get input? — zdim, Jul 21 '19 at 02:04
Thank you for the update. A question: the text after "_I've these files:_" --- is that a file, and you want the count of artist from it? Is that the whole job? (The code that you show is way, way too complicated.) — zdim, Jul 21 '19 at 09:54
Is it always the first two words on the line that make up the name? — zdim, Jul 21 '19 at 09:55
I reply you. I've updated the question because I need of many characters — manub, Jul 21 '19 at 15:42
Thank you; that's a _much_ more difficult problem. (1) Getting your code to do unicode properly isn't hard, which was the original question and what my answer addressed (2) Parsing _names_ in natural languages correctly is so hard that it really isn't possible to solve in general. But here you've got criteria for how to truncate the string, so that helps. However, I don't understand your rules well so I offer only a basic framework for it. see edit. — zdim, Jul 21 '19 at 19:20

zdim · Answer 1 · 2021-12-15T08:33:47.807

6

The shown one-liner doesn't enable any unicode support.^† You'd want, at least, to set up input/output streams for it, and in a script I'd recommend

use open qw(:std :encoding(UTF-8));

In a one-liner there are switches; see what combination you need in perlrun, under -C. For example

echo "á, Ș." | perl -CASD -wnE'@m = /\w+/g; say for @m'

prints

á
Ș

so the accented characters are understood.

Additionally, you may need \X (instead of \w) to match an extended grapheme cluster.

^† This post may be relevant, with a comforting first part but scary (and informative) rest.

Literature: perlunitut, perlunifaq, perluniintro (with its Unicode I/O for example), and perlunicode. Have perluniprops handy. There is also a cookbook of sorts, perlunicook (see Standard preamble for starters), and there's Encode.

Note that the regex per se is unicode aware.

The question got edited, with additions of code, example input and its processing, and a link to a complete program. Some clarification on how names are decided are added, for example:

The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full

which means that the truncated name need be at least two-words long, or the string shouldn't be truncated (as clarified in comments). This bypasses almost completely the very difficult problem of parsing names in natural languages, since the "conjunctions" are meant to be provided.

Using a few from that list (from a program linked in the question), for a demo

use warnings;
use strict;
use feature 'say';

use utf8;                            # for utf8 characters in this script
use open qw(:std :encoding(UTF-8));  # for standard streams

sub extract_name {
    my ($line) = @_;
    # Rule for extracting the name:
    #   Truncate at $cutoff phrase if there are at least two words before it
    #   (incomplete list of alternations for a demo, from linked program)
    my $cutoff = qr{\s+(?:-|&|And|Con|Și)(?:\s+|\z)};  # with spaces
    my $parens = qr{\s+\(};                            # no space after

    # If there is a cut-off phrase on the line, extract what's before it
    # If that is at least two words long, return it;
    #   otherwise, return the whole line 
    if ( my ($name) = $line =~ /(.*?)(?:$cutoff|$parens)/ ) {
        return $name if split(' ', $name) >= 2;
    }
    return $line;
}

my $file = shift // die "Usage: $0 file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my %name_count;
while (my $line = <$fh>) { 
    chomp $line;
    ++$name_count{ extract_name($line) };
}

say "$_; $name_count{$_}" for sort keys %name_count;

The regex pattern for a "conjunction" (cutoff phrase) is formed using qr operator for easier work. It is simply an alternation (|) of given conjunctions, here a few picked up from the linked program. I separate those that don't need a trailing space into another pattern, here only for parenthesis.

It is a good idea to sort reports as they are printed so I do this even though sort with cmp may produce incorrect results with unicode; please see this post for how to correctly sort with utf8.

I test this with the input shown in the question, to which I add lines

Johnny & The Hurricanes
An Awesome Band (Unknown)

so to be able to test the finer points of the criteria for the name. It prints

An Awesome Band; 1
Betty Curtis; 3
Johny & The Hurricanes; 1
Margareta Pâslaru; 3
Matilde Sánchez; 2

I strongly advise against a "one"-liner for a job of this complexity (I could barely get the above sub to parse and work correctly when packed into a command-line).

If this program needs to work with lines piped into it let me know and I can add that.

edited Dec 15 '21 at 08:33

answered Jul 21 '19 at 02:13

zdim

64,580
5
52
81

**About the code ** [perl -CSD -wnE' ++$name_count{ (/(\w+\s+\w+)/)[0] }; END { say "$_; $name_count{$_}" for sort keys %name_count } ' input] ** I cannot understand how it could be substitute the code ** [perl -pe 's/^(\D\X+ \X+)( Con )/\1;Con/gi' | perl -pe 's/^(\D\X+ \X+ \X+)( Con )/\1;Con/gi'| perl -pe 's/^(\D\X+ \X+ \X+ \X+)( Con )/\1;Con/gi'| perl -pe 's/^(\D\X+ \X+ \X+ \X+ \X+)( Con )/\1;Con/gi'|] – manub Jul 21 '19 at 17:01
@manub "_I cannot understand how it could be substitute_" --- I didn't try to substitute for your code. The question wants to extract names from text, so i wrote code for that. I can't imagine that you really need code as complicated as shown, so I've tried to offer you a simpler approach. But I don't know what your rules are for what a "name" is; that is a difficult problem in general. – zdim Jul 21 '19 at 19:05
@manub You say "_remove conjunctions if the name has two or more words._" -- OK, that's clear ... but the question doesn't quite say that (read it again carefully). After "_Through the work of the script that produces this file_" you repeat the text from before? (Or are details that I am missing?) Then you give the total count for a two-word name ("Duke Elington"). _I don't understand._ Then, you say "_30 cases (rules) in addition to “&”_" and then roll off unformatted phrases like "_ e _` (single e with spaces) ... is that with an accent of some sort? Please be clearer. – zdim Jul 23 '19 at 01:28
@manub Please recall the question -- "_Issue matching accented characters_" -- that's because you didn't turn on unicode support, and I answered that. All this other stuff is extra. But I'd be happy to help you out with that, too, if you can clarify your criteria. – zdim Jul 23 '19 at 01:31
@manub Now can you please clarify your criteria for this _additional problem_ ? (As I have solved the one with accented characters.) Is it like this: truncate the line at any one of those (30-ish) characters, if it's more than two words long? Is that correct and is that it? – zdim Jul 23 '19 at 01:38
#1 "then another one for three words, then for four, then five ..". Do you've dowloaded the code at the link at the bottom of the anwser? Where there is "Duke Ellington 208". #2 "After "Through the work of the script that produces this file" you repeat the text from before? (Or are details that I am missing?) " Yes, look at the carachters ";" after "Ellington". See how it repeats itself regularly to cut off strings. #3 "My first response above got deleted (??)" I have difficulty using this forum. There may be misunderstandings. Please answer the 3 points and then I'll clarify everything – manub Jul 23 '19 at 08:55
@manub (1) yes, I looked at complete code and I see the long pipelines for two words (different from the question?). (2) Ah, the semicolon, I see that now; have no clue what that means though. An intermediate result? (3) I had posted a comment first, which then disappeared. (Never mind, I repeated what mattered) – zdim Jul 23 '19 at 09:05
@manub Thank you for providing the whole program (I cannot afford the time to study that but I see better what you are doing). Those pipelines are not needed -- that can be all done in a line or a few, in a script. If you can state clearly the rules for extracting the name from a line I can help you out with Perl for that. (I asked in a comment above -- is it like: "truncate the line at any one of particular characters/phrases, if the line is more than two words long" ... ?) It's late here and I need to go now but I'll look tomorrow if you can clarify. – zdim Jul 23 '19 at 09:08
#1 "different from the question?" Yes, stimulated by the discussion, I solved it. Now with 2 "w +" it also captures more words. There are also 2 "X +" because "w" works on words without accents and "X" works on words with accents #2 "intermediate result?" Yes. #3 "truncate the line in particular characters / phrases, if the line is more than two words long" My script does just that thanks to Perl -pe w + X + – manub Jul 23 '19 at 10:33
@manub Edited my post. Now the second part (crudely) implements your criteria for extracting names from the line; please adjust as needed, and let me know if code isn't clear. – zdim Jul 23 '19 at 18:58
I created a script file with chmod + x but doesen't works. – manub Jul 25 '19 at 19:46
There are big issues.> line 2: sub: command not found > line 3: syntax error near unexpected token `$line' > line 3: ` my ($line) = @_;' I put my path here: my $file = /MYPATH/ "Usage: $0 file\n"; – manub Jul 25 '19 at 20:06
@manub I have no idea what that means. (`sub` command? there is no such thing in my program?) I just copy-pasted this again, to check just in case, and it runs as intended. Did you copy it correctly? (Btw, what is your Perl version?) – zdim Jul 25 '19 at 21:18
I've Perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level. I tried again, saved new file with chmod + x. It gives errors of which two tell #1: "line 6: syntax error near unexpected token `('" #2: "line 6: `use open qw(:std :encoding(UTF-8)); # for standard streams'" – manub Jul 26 '19 at 09:05

Issue matching accented characters with Perl

1 Answers1