3

I've bumped into a very complicated problem (in my perspective as a newbie) and I'm not sure how to solve it. I can think of the workflow but not the script.

I have file A that looks like the following: Teacher (tab) Student1(space)Student2(space)..

Fiona       Nicole Sherry 
James       Alan Nicole
Michelle    Crystal 
Racheal     Bobby Dan Nicole

They sometimes have numbers right next to their names when there are two of the same name (ex, John1, John2). Students may also overlap if they have more than two advisors..

File B is a file that has groups of teachers together. It looks similar but the values are comma-delimited.

Fiona       Racheal,Jack
Michelle    Racheal
Racheal     Fiona,Michelle
Jack        Fiona

The trend in file B is that a key has multiple values and each value becomes a key as well to easily find who is grouped with who.

The output I would like is which students will be likely to receive similar education based on their teacher/groups.So I would like the script to do the following:

  1. Store file A into a hash and close
  2. Open file B, go through each teacher to see if they have students (some may not, the actual list is quite big..). So if I take the first teacher, Fiona, it will look in stored file A hash table to see if there is a Fiona. If there is, (in this case, Nicole and Sherry), pop them each as new keys to a new hash table.

    while (<Group>) {
        chomp;
        $data=$_;
        $data=~/^(\S+)\s+(.*)$/;
        $TeacherA=$1;
        $group=$2; 
    
  3. Then, look at the group of teachers who are grouped with Fiona (Racheal, Jack). Take 1 person at a time (Racheal)

    if (defined??) {
        while ($list=~/(\w+)(.*)/) {
            $TeacherB=$1;
            $group=$2;
    
  4. Look at file A for Racheal's students.
  5. Fill them as values (comma-delimited) for student keys made from step 2.
  6. Print student-student and teacher-teacher group.

    Nicole  Bobby,Dan,Nicole    Fiona   Racheal
    Sherry  Bobby,Dan,Nicole    Fiona   Racheal
    

    Since the next teacher in Fiona's group, Jack, didn't have any students, he would not be in this results. If he had, for example, David, the results would be:

    Nicole  Bobby,Dan,Nicole    Fiona   Racheal
    Sherry  Bobby,Dan,Nicole    Fiona   Racheal
    Nicole  David               Fiona   Jack
    Sherry  David               Fiona   Jack
    

I'm so sorry for asking such a complicated and specific question. I hope other people who are doing something like this by any chance may benefit from the answers. Thank you so much for your help and reply. You are my only source of help.

  • I'm confused about the contents of File B. Do all of the names in File B belong to teachers? –  Apr 23 '12 at 07:26
  • Yes, File B belong to teachers. Sorry everyone, it may seem strange what I'm doing and being a newbie, I did start doing it manually but the file I have is rather large, rather than the simplified snapshot I've shown here, and is taking me ages and making me confused at the same time. – absolutenewbie Apr 24 '12 at 01:30
  • It was a good idea to go with perl instead of manual work. Perl is great for that kind of thing. – simbabque Apr 26 '12 at 08:26

2 Answers2

1

This is a rather strange way to look at the data, but I think I got it to work the way you tried. It would be interesting to see why you want the data to be that way. Maybe provide column headings next time. Knowing why you do something in a certain way often makes it a lot easier to think of ways to achive it imo.

So here's what I did. Don't get confused, I put your values from file A and file B into scalars and changed the part about reading them.

my $file_a = qq~Fiona\tNicole Sherry
James\tAlan Nicole
Michelle\tCrystal
Racheal\tBobby Dan Nicole
~;

my $file_b = qq~Fiona\tRacheal,Jack
Michelle\tRacheal
Racheal\tFiona,Michelle
Jack\tFiona
~;

After that, proceed to read the 'files'.

# 1: Store file A in a hash
my (%file_a);
foreach my $a (split /\n/, $file_a) {
  my @temp = split /\t/, $a;
  $file_a{$temp[0]} = $temp[1];
}

# 2: Go through file B
foreach my $b (split /\n/, $file_b) {
  my @line_b = split /\t/, $b;
  # Look in stored file A if the teacher is there
  if (exists $file_a{$line_b[0]}) {
    my (%new_hash_table, @teachers);
    # Put all the students of this teacher into a new hash
    $new_hash_table{$_} = '' foreach split / /, $file_a{$line_b[0]};

    # 3: Take one of the group of teachers who are grouped with the 
    # current teacher at a time
    foreach my $teacher (split /,/, $line_b[1]) {
      if (exists $file_a{$teacher}) {
        # 4: This teacher from the group has students listen in file A
        push @teachers, $teacher; # Store the teacher's name for print later
        foreach (keys %new_hash_table) {
          # 5: Fill the students as csv for the student keys from step 2
          $new_hash_table{$_} = join(',', split(/ /, $file_a{$teacher}));
        }
      }
    }
    foreach my $student (keys %new_hash_table) {
      # 6: Print...        
      print join("\t", 
        # Student-student relation
        $student, $new_hash_table{$student}, 
        # Teacher-teacher relation
        $line_b[0], @teachers);
      print "\n";
    }
  }
}

For me that provides the following output:

Sherry  Bobby,Dan,Nicole    Fiona   Racheal
Nicole  Bobby,Dan,Nicole    Fiona   Racheal
Crystal Bobby,Dan,Nicole    Michelle    Racheal
Bobby   Crystal Racheal Fiona   Michelle
Nicole  Crystal Racheal Fiona   Michelle
Dan Crystal Racheal Fiona   Michelle

This is probably weird since I don't have all the values.

Anyways, there are a few things to be said to this.

In your example code you used a regex like $data=~/^(\S+)\s+(.*)$/; to get to the values of a simple two-column list. It is a lot easier to use the split operator to do that.

When you read from a file with the <FILEHANDLE> syntax, you can put the scalar you want your lines to go into in the while loop's condition like so:

while (my $data = <GROUP>) {
      chomp $data

Also it is common to write filehandle names in all-caps.

I'd suggest you take a look at the 'Learning Perl'. The basic concepts of hashes and arrays in there should be enough to takle tasks like this one. Hope this helps.

simbabque
  • 53,749
  • 8
  • 73
  • 136
  • 1
    Don't use Typeglobs, use lexical filehandles instead. See http://stackoverflow.com/questions/3276674/which-one-is-good-practice-a-lexical-filehandle-or-a-typeglob or http://stackoverflow.com/questions/1479741/why-is-three-argument-open-calls-with-lexical-filehandles-a-perl-best-practice – dgw Apr 23 '12 at 09:42
  • @simbabque-Thanks for tackling my question. I know it sounds weird. I just had a question about changing my my two files into scalars. Please forgive me if this is a naive question, but if my file size is much larger than what I've given as an example, would doing that still be the best way to do it? If not, would you be able to suggest anything else? – absolutenewbie Apr 24 '12 at 04:49
  • No, it would not. In fact, I only did it out of lazyness so I wouldn't have to create the files. But the way the above program works is just the same regardless of where you take your data from. Of course in a productive program you would want your input to be dislodged from the program. You want it to be passed in as parameters (in this case two filenames) because you'd have to change the program code each time you receive new data otherwise. And always remember, in perl there's more than one way to do it, but there seldom is a best way. ;-) – simbabque Apr 24 '12 at 07:22
  • @simbabque - I've actually put the following lines at the very start: 'open $teacher, '<', 'teacher.txt' or die$!;' 'open $group, '<', 'group.txt' or die$!;' and 'close $teacher; close $group;' at the end. It's not giving me any errors but it's not giving me any output either. Would you be able to suggest why? Thank you so much. – absolutenewbie Apr 26 '12 at 06:29
  • @absolutenewbie: please use `code` syntax for code. You need to actually read from the file, not only open it. Have you changed that, too? Refer to [open](http://perldoc.perl.org/functions/open.html) on perldoc and look at my last example above. Are the files in the working directory (i.e. the same dir) as the script itself? – simbabque Apr 26 '12 at 08:24
1

I can't imagine why you would want this redundant data when you could just look at file A to get a good idea of who was getting a similar education ... but here is a way of doing it in perl all the same.

$data = {};
# pull in students
open(IN, "students.txt");
while(my $line = <IN>) {
  chomp($line);
  my ($teacher, @students) = split(/\s+/,$line);
  $data->{$teacher}->{students} = \@students;
}
close IN;
# pull in teachers
open(IN, "teachers.txt");
while(my $line = <IN>) {
  chomp($line);
  my ($teacher, $supporters) = split(/\s+/,$line);
  my @supporters = split(/,/,$supporters);
  $data->{$teacher}->{supporters} = \@supporters;
}
close IN;
# make the output
foreach my $teacher (keys %{$data}){
  foreach my $teacher_student (@{$data->{$teacher}->{students}}) {
    foreach my $supporter (@{$data->{$teacher}->{supporters}}){
      my $num_supporter_students = @{$data->{$supporter}->{students}} + 0;
      if($num_supporter_students) {

        print "$teacher_student\t" . 
              join(",",@{$data->{$supporter}->{students}}) .
              "\t$teacher\t$supporter\n";
      }
    }
  }
}

When run on the data listed in the question it returns:

Crystal Bobby,Dan,Nicole    Michelle    Racheal
Nicole  Bobby,Dan,Nicole    Fiona   Racheal
Sherry  Bobby,Dan,Nicole    Fiona   Racheal
Bobby   Nicole,Sherry   Racheal Fiona
Bobby   Crystal Racheal Michelle
Dan Nicole,Sherry   Racheal Fiona
Dan Crystal Racheal Michelle
Nicole  Nicole,Sherry   Racheal Fiona
Nicole  Crystal Racheal Michelle
zortacon
  • 617
  • 5
  • 15
  • use `for` instead `foreach` and `open(IN, "<","filename.txt")` – gaussblurinc Apr 23 '12 at 10:43
  • @zortacon-first, thank you so much for tackling my question. I think I understand how you were doing it. Unfortunately, it's giving me an error saying: 'Use of uninitialized value in array dereference at match_student.pl line24.' and 'Use of uninitialized value in addition (+) at match_student.pl line24.' It's also not giving me the full list I'm after. I think it's only comparing the first match (the Fiona-Racheal pair). – absolutenewbie Apr 24 '12 at 04:45
  • the error means that the teacher had no students set while they were being read from the students text file. It is dereferencing the link into an array and finding nothing there. That should just be a warning and not stop it from running. – zortacon Apr 24 '12 at 05:50
  • Yes, it's working for the example set I provided but not for the actual file I have. I wonder what's causing the difference.. – absolutenewbie Apr 24 '12 at 07:11