Hive – regexp_replace function for multiple strings

Hive – regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like “hip hop” and “rock music” in my data and replace them with “hiphop” and “rockmusic” – basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, ‘hip hop’, ‘hiphop’), ‘rock music’, ‘rockmusic’) as ntext1
from vp_nlp_protext_males
;

But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase – hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, ‘(hip hop)|(rock music)’) as ntext1
from vp_nlp_protext_males
;

Solutions/Answers:

Solution 1:

You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:

drop table vp_hiphop;
create table vp_hiphop as
select  userid, ntext,
        translate(ntext, ' ', '') as ntext1
from  vp_nlp_protext_males
;

References

Related:  Regex pattern for underscore or hyphen but not both