python - Perl: counting many words in many strings efficiently -

- July 15, 2010

i find myself needing count number of times words appear in number of text strings. when this, want know how many times each word, individually, appears in each text string.

i don't believe approach efficient , give me great.

usually, write loop (1) pulls in text txt file text string, (2) executes loop loops on words want count using regular expression check how many times given word appears each time pushing count array, (3) prints array of counts separated commas file.

here example:

#create array holds list of words i'm looking count; @word_list = qw(word1 word2 word3 word4);  #create array holds names of txt files want count; $data_loc = "/data/txt_files_for_counting/" opendir(dir1,"$data_loc")||die "can't open directory"; @file_names=readdir(dir1);   #create place save results; $out_path_name = "/output/my_counts.csv"; open (out_file, ">>", $out_path_name);  #run loops; foreach $file(@file_names){     if ($file=~/^\./)         {next;}     #pull in text txt filea;     {         $p_file = $data_loc."/".$file;         open (b, "$p_file") or die "can't open file: $p_file: $!";          $text_of_txt_file = {local $/; <b>};          close b or die "cannot close $p_file: $!";           }      #preserve filename counts interpretable;     print out_file $file;      foreach $wl_word(@word_list){         #use regular expression search term without context;         @finds_p = ();         @finds_p = $text_of_txt_file =~ m/\b$wl_word\b/g;         $n_finds = @finds_p;         print out_file ",".$n_finds;     }     print out_file ",\n"; } close(out_file);

i've found approach inefficient (slow) number of txt files , number of words want count grow.

is there more efficient way this?

is there perl package this?

could more efficient in python? (e.g., there python package this?)

thanks!

edit: note, don't want count number of words, rather presence of words. thus, answer in question "what's fastest way count number of words in string in perl?" doesn't quite apply. unless i'm missing something.

first off - you're doing opendir - wouldn't , suggest glob instead.

and otherwise - there's useful trick. compile regex "words". reason useful, because - variable in regex, needs recompile regex each time - in case variable has changed. if it's static, no longer need to.

use strict; use warnings; use autodie;  @words = ( "word1", "word2", "word3", "word4", "word5 word6" ); $words_regex = join( "|", map ( quotemeta, @words  )); $words_regex = qr/\b($words_regex)\b/;  open( $output, ">", "/output/my_counts.csv" );  foreach $file ( glob("/data/txt_files_for_counting") ) {     open( $input, "<", $file );     %count_of;     while (<$input>) {         foreach $match (m/$words_regex/g) {             $count_of{$match}++;         }     }     print {$output} $file, "\n";     foreach $word (@words) {         print {$output} $word, " => ", $count_of{$word} // 0, "\n";      }     close ( $input ); }

with approach - no longer need 'slurp' whole file memory in order process it. (which may not big advantage, depending how large files are).

when fed data of:

word1 word2 word3 word4 word5 word6 word2 word5 word4 word4 word5 word word 45 sdasdfasf word5 word6  sdfasdf sadf

outputs:

word1 => 1 word2 => 2 word3 => 1 word4 => 3 word5 word6 => 2

i note - if have overlapping substrings in regex, won't work - it's possible though, need different regex.

Search This Blog

harsh

python - Perl: counting many words in many strings efficiently -

Comments

Post a Comment

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -